tailieunhanh - Báo cáo khoa học: "On the Evaluation and Comparison of Taggers: the Effect of Noise in Testing Corpora."

This paper addresses the issue of POS tagger evaluation. Such evaluation is usually performed by comparing the tagger output with a reference test corpus, which is assumed to be error-free. Currently used corpora contain noise which causes the obtained performance to be a distortion of the real value. We analyze to what extent this distortion may invalidate the comparison between taggers or the measure of the improvement given by a new system. The main conclusion is that a more rigorous testing experimentation setting/designing is needed to reliably evaluate and compare tagger accuracies. . | On the Evaluation and Comparison of Taggers the Effect of Noise in Testing Corpora. Lỉuís Padró and Lluis Marquez Dep. LSI. Technical University of Catalonia c Jordi Girona 1-3. 08034 Barcelona padro lluism @ Abstract This paper addresses the issue of POS tagger evaluation. Such evaluation is usually performed by comparing the tagger output with a reference test corpus which is assumed to be error-free. Currently used corpora contain noise which causes the obtained performance to be a distortion of the real value. We analyze to what extent this distortion may invalidate the comparison between taggers or the measure of the improvement given by a new system. The main conclusion is that a more rigorous testing experimentation setting designing is needed to reliably evaluate and compare tagger accuracies. 1 Introduction and Motivation Part of Speech pos Tagging is a quite well defined NLP problem which consists of assigning to each word in a text the proper mor-phosyntactic tag for the given context. Although many words are ambiguous regarding their POS in most cases they can be completely disambiguated taking into account an adequate context. Successful taggers have been built using several approaches such as statistical techniques symbolic machine learning techniques neural networks etc. The accuracy reported by most current taggers ranges from 96-97 to almost 100 in the linguistically-motivated Constraint Grammar environment. Unfortunately there have been very few direct comparisons of alternative taggers1 on identical test data. However in most current papers it is argued that the performance of some taggers is better than others as a result of some kind of indirect comparisons between them. We One of the exceptions is the work by Samuelsson and Voutilainen 1997 in which a very strict comparison between taggers is performed. think that there are a number of not enough controlled considered factors that make these conclusions dubious in most cases. In this