Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Automatic Detection and Correction of Errors in Dependency Treebanks"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. | Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrucken Germany alexander.volokh@dfki.de Gunter Neumann DFKI Stuhlsatzenhausweg 3 66123 Saarbrucken Germany neumann@dfki.de Abstract Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the er -rors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation it has a very high pre -cision and thus is in any case beneficial for the quality of the corpus it is applied to. At last we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1 Introduction Treebanks and other annotated corpora have become essential for almost all NLP applications. Papers about corpora like the Penn Treebank 1 have thousands of citations since most of the algorithms profit from annotated data during the development and testing and thus are widely used in the field. Treebanks are therefore expected to be of a very high quality in order to guarantee reliability for their theoretical and practical uses. The construction of an annotated corpus involves a lot of work performed by large groups. However despite the fact that a lot of human post-editing and automatic quality assurance is done errors can not be avoided completely 5 . 346 In this paper we propose an approach for finding and correcting errors in dependency treebanks. We apply our method to the English dependency corpus - conversion of the Penn .