tailieunhanh - Báo cáo khoa học: "An All-Subtrees Approach to Unsupervised Parsing"

We investigate generalizations of the allsubtrees "DOP" approach to unsupervised parsing. Unsupervised DOP models assign all possible binary trees to a set of sentences and next use (a large random subset of) all subtrees from these binary trees to compute the most probable parse trees. We will test both a relative frequency estimator for unsupervised DOP and a maximum likelihood estimator which is known to be statistically consistent. We report state-ofthe-art results on English (WSJ), German (NEGRA) and Chinese (CTB) data. . | An All-Subtrees Approach to Unsupervised Parsing Rens Bod School of Computer Science University of St Andrews North Haugh St Andrews KY16 9sX Scotland UK rb@ Abstract We investigate generalizations of the allsubtrees DOP approach to unsupervised parsing. Unsupervised DOP models assign all possible binary trees to a set of sentences and next use a large random subset of all subtrees from these binary trees to compute the most probable parse trees. We will test both a relative frequency estimator for unsupervised DOP and a maximum likelihood estimator which is known to be statistically consistent. We report state-of-the-art results on English WSJ German NEGRA and Chinese CTB data. To the best of our knowledge this is the first paper which tests a maximum likelihood estimator for DOP on the Wall Street Journal leading to the surprising result that an unsupervised parsing model beats a widely used supervised model a treebank PCFG . 1 Introduction The problem of bootstrapping syntactic structure from unlabeled data has regained considerable interest. While supervised parsers suffer from shortage of hand-annotated data unsupervised parsers operate with unlabeled raw data of which unlimited quantities are available. During the last few years there has been steady progress in the field. Where van Zaanen 2000 achieved unlabeled f-score on ATIS word strings Clark 2001 reports on the same data and Klein and Manning 2002 obtain f-score on ATIS part-of-speech strings using a constituent-context model called CCM. On Penn Wall Street Journal p-o-s-strings 10 WSJ10 Klein and Manning 2002 report unlabeled f-score with CCM. And the hybrid approach of Klein and Manning 2004 which combines constituency and dependency models yields f-score. Bod 2006 shows that a further improvement on the WSJ10 can be achieved by an unsupervised generalization of the all-subtrees approach known as Data-Oriented Parsing DOP . This unsupervised DOP model coined .

TÀI LIỆU LIÊN QUAN