tailieunhanh - Báo cáo khoa học: "Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems"

Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. | They Can Help Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems Nitin Madnani Joel Tetreault Martin Chodorowb Alla Rozovskayac Educational Testing Service Princeton NJ nmadnani jtetreault @ Hunter College of CUNY University of Illinois at Urbana-Champaign rozovska@ Abstract Despite the rising interest in developing grammatical error detection systems for non-native speakers of English progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language ESL . According to Guo and Beckett 2007 over a billion people speak English as their second or foreign language. This high demand has resulted in many NLP research papers on the topic a Synthesis Series book Leacock et al. 2010 and a recurring workshop Tetreault et al. 2010a all in the last five years. In this year s ACL conference there are four long papers devoted to this topic. Despite the growing interest two major factors encumber the growth of this subfield. First the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors such as those in article and preposition usage are a matter of degree rather than simple rule violations such as number agreement. As a consequence it is common for two native speakers 508 to have different judgments of usage. Therefore an appropriate evaluation should take this into account by not only enlisting multiple human judges but .

TỪ KHÓA LIÊN QUAN