tailieunhanh - Báo cáo khoa học: "Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation"

Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (. the Penn Treebank), (2) bitexts (. Europarl), and (3) unannotated monolingual (. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. . | Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation Shane Bergsma David Yarowsky Kenneth Church Deptartment of Computer Science and Human Language Technology Center of Excellence Johns Hopkins University sbergsma@ yarowsky@ Abstract Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases NPs . Parsers trained on the Penn Treebank are reporting impressive numbers these days but they don t do very well on this problem 79 . We explore systems trained using three types of corpora 1 annotated . the Penn Treebank 2 bitexts . Eu-roparl and 3 unannotated monolingual . Google N-grams . Size matters 1 is a million words 2 is potentially billions of words and 3 is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96 accuracy on Treebank data and makes 20 fewer errors than a supervised system trained with Treebank annotations. 1 Introduction Determining which words are being linked by a coordinating conjunction is a classic hard problem. Consider the pair ellipsis rocket w and mortar w2 attacks h ellipsis asbestos vn and polyvinyl w2 chloride h ellipsis is about both rocket attacks and mortar attacks unlike ellipsis which is not about asbestos 1346 chloride. We use h to refer to the head of the phrase and W1 and w2 to refer to the other two lexical items. Natural Language Processing applications need to recognize NP ellipsis in order to make sense of new sentences. For example if an Internet search engine is given the phrase rocket attacks as a query it .

TỪ KHÓA LIÊN QUAN