tailieunhanh - Báo cáo khoa học: "What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy?"

We aim at finding the minimal set of fragments which achieves maximal parse accuracy in Data Oriented Parsing. Experiments with the Penn Wall Street Journal treebank show that counts of almost arbitrary fragments within parse trees are important, leading to improved parse accuracy over previous models tested on this treebank (a precis ion of and a recall of ). We isolate some dependency relations which previous models neglect but which contribute to higher parse accuracy. | What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy Rens Bod School of Computing University of Leeds Leeds LS2 9JT Institute for Logic Language and Computation University of Amsterdam Spuistraat 134 1012 VB Amsterdam rens@ Abstract We aim at finding the minimal set of fragments which achieves maximal parse accuracy in Data Oriented Parsing. Experiments with the Penn Wall Street Journal treebank show that counts of almost arbitrary fragments within parse trees are important leading to improved parse accuracy over previous models tested on this treebank a precis -ion of and a recall of . We isolate some dependency relations which previous models neglect but which contribute to higher parse accuracy. 1 Introduction One of the goals in statistical natural language parsing is to find the minimal set of statistical dependencies between words and syntactic structures that achieves maximal parse accuracy. Many stochastic parsing models use linguistic intuitions to find this minimal set for example by restricting the statistical dependencies to the locality of headwords of constituents Collins 1997 1999 Eisner 1997 leaving it as an open question whether there exist important statistical dependencies that go beyond linguistically motivated dependencies. The Data Oriented Parsing DOP model on the other hand takes a rather extreme view on this issue given an annotated corpus all fragments . subtrees seen in that corpus regardless of size and lexicalization are in principle taken to form a grammar see Bod 1993 1998 Goodman 1998 Sima an 1999 . The set of subtrees that is used is thus very large and extremely redundant. Both from a theoretical and from a computational perspective we may wonder whether it is possible to impose constraints on the subtrees that are used in such a way that the accuracy of the model does not deteriorate or perhaps even improves. That is the main question addressed in this paper. We report on .