tailieunhanh - Báo cáo khoa học: "Detecting Errors in Automatically-Parsed Dependency Relations"
We outline different methods to detect errors in automatically-parsed dependency corpora, by comparing so-called dependency rules to their representation in the training data and flagging anomalous ones. By comparing each new rule to every relevant rule from training, we can identify parts of parse trees which are likely erroneous. Even the relatively simple methods of comparison we propose show promise for speeding up the annotation process. | Detecting Errors in Automatically-Parsed Dependency Relations Markus Dickinson Indiana University md7@ Abstract We outline different methods to detect errors in automatically-parsed dependency corpora by comparing so-called dependency rules to their representation in the training data and flagging anomalous ones. By comparing each new rule to every relevant rule from training we can identify parts of parse trees which are likely erroneous. Even the relatively simple methods of comparison we propose show promise for speeding up the annotation process. 1 Introduction and Motivation Given the need for high-quality dependency parses in applications such as statistical machine translation Xu et al. 2009 natural language generation Wan et al. 2009 and text summarization evaluation Owczarzak 2009 there is a corresponding need for high-quality dependency annotation for the training and evaluation of dependency parsers Buchholz and Marsi 2006 . Furthermore parsing accuracy degrades unless sufficient amounts of labeled training data from the same domain are available . Gildea 2001 Sekine 1997 and thus we need larger and more varied annotated treebanks covering a wide range of domains. However there is a bottleneck in obtaining annotation due to the need for manual intervention in annotating a treebank. One approach is to develop automatically-parsed corpora van Noord and Bouma 2009 but a natural disadvantage with such data is that it contains parsing errors. Identifying the most problematic parses for human post-processing could combine the benefits of automatic and manual annotation by allowing a human annotator to efficiently correct automatic errors. We thus set out in this paper to detect errors in automatically-parsed data. If annotated corpora are to grow in scale and retain a high quality annotation errors which arise from automatic processing must be minimized as errors have a negative impact on training and evaluation of NLP technology see discussion .
đang nạp các trang xem trước