tailieunhanh - Báo cáo khoa học: "Multi-Tagging for Lexicalized-Grammar Parsing"

With performance above 97% accuracy for newspaper text, part of speech (POS) tagging might be considered a solved problem. Previous studies have shown that allowing the parser to resolve POS tag ambiguity does not improve performance. However, for grammar formalisms which use more fine-grained grammatical categories, for example TAG and CCG, tagging accuracy is much lower. In fact, for these formalisms, premature ambiguity resolution makes parsing infeasible. We describe a multi-tagging approach which maintains a suitable level of lexical category ambiguity for accurate and efficient CCG parsing. . | Multi-Tagging for Lexicalized-Grammar Parsing James R. Curran School of IT University of Sydney NSW 2006 Australia james@ Stephen Clark Computing Laboratory Oxford University Wolfson Building Parks Road Oxford OX1 3QD UK David Vadas School of IT University of Sydney NSW 2006 Australia dvadas1@ sclark@ Abstract With performance above 97 accuracy for newspaper text part of speech POS tagging might be considered a solved problem. Previous studies have shown that allowing the parser to resolve POS tag ambiguity does not improve performance. However for grammar formalisms which use more fine-grained grammatical categories for example TAG and CCG tagging accuracy is much lower. In fact for these formalisms premature ambiguity resolution makes parsing infeasible. We describe a multi-tagging approach which maintains a suitable level of lexical category ambiguity for accurate and efficient CCG parsing. We extend this multitagging approach to the POS level to overcome errors introduced by automatically assigned PO S tags. Although POS tagging accuracy seems high maintaining some POS tag ambiguity in the language processing pipeline results in more accurate CCG supertagging. 1 Introduction State-of-the-art part of speech POS tagging accuracy is now above 97 for newspaper text Collins 2002 Toutanova et al. 2003 . One possible conclusion from the POS tagging literature is that accuracy is approaching the limit and any remaining improvement is within the noise of the Penn Treebank training data Ratnaparkhi 1996 Toutanova et al. 2003 . So why should we continue to work on the POS tagging problem Here we give two reasons. First for lexicalized grammar formalisms such as TAG and CCG the tagging problem is much harder. Second any errors in POS tagger output even at 97 acuracy can have a significant impact on components further down the language processing pipeline. In previous work we have shown that using automatically assigned rather than

TÀI LIỆU LIÊN QUAN