tailieunhanh - Báo cáo khoa học: "Supervised Grammar Induction using Training Data with Limited Constituent Information *"
Corpus-based grammar induction generally relies on hand-parsed training data to learn the structure of the language. Unfortunately, the cost of building large annotated corpora is prohibitively expensive. This work aims to improve the induction strategy when there are few labels in the training data. We show that the most informative linguistic constituents are the higher nodes in the parse trees, typically denoting complex noun phrases and sentential clauses. They account for only 20% of all constituents. . | Supervised Grammar Induction using Training Data with Limited Constituent Information Rebecca Hwa Division of Engineering and Applied Sciences Harvard University Cambridge MA 02138 USA reb ecca@eecs. harvard. edu Abstract Corpus-based grammar induction generally relies on hand-parsed training data to learn the structure of the language. Unfortunately the cost of building large annotated corpora is prohibitively expensive. This work aims to improve the induction strategy when there are few labels in the training data. We show that the most informative linguistic constituents are the higher nodes in the parse trees typically denoting complex noun phrases and sentential clauses. They account for only 20 of all constituents. For inducing grammars from sparsely labeled training data . only higher-level constituent labels we propose an adaptation strategy which produces grammars that parse almost as well as grammars induced from fully labeled corpora. Our results suggest that for a partial parser to replace human annotators it must be able to automatically extract higher-level constituents rather than base noun phrases. 1 Introduction The availability of large hand-parsed corpora such as the Penn Treebank Project has made high-quality statistical parsers possible. However the parsers risk becoming too tailored to these labeled training data that they cannot reliably process sentences from an arbitrary domain. Thus while a parser trained on the Wall Street Journal corpus can fairly accurately parse a new Wall Street Journal article it may not perform as well on a New Yorker article. To parse sentences from a new domain one would normally directly induce a new grammar This material is based upon work supported by the National Science Foundation under Grant No. IRI 9712068. We thank Stuart Shieber for his guidance and Lillian Lee Ric Crabbe and the three anonymous reviewers for their comments on the paper. from that domain in which the training process would require .
đang nạp các trang xem trước