Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Creating a manually error-tagged and shallow-parsed learner corpus"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. | Creating a manually error-tagged and shallow-parsed learner corpus Ryo Nagata Konan University 8-9-1 Okamoto Kobe 658-0072 Japan rnagata @ konan-u.ac.jp. Edward Whittaker Vera Sheinman The Japan Institute for Educational Measurement Inc. 3-2-4 Kita-Aoyama Tokyo 107-0061 Japan whittaker sheinman @jiem.co.jp Abstract The availability of learner corpora especially those which have been manually error-tagged or shallow-parsed is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background we created a novel learner corpus that was manually error-tagged and shallow-parsed. This corpus is available for research and educational purposes on the web. In this paper we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POS-tagging chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring. 1 Introduction The availability of learner corpora is still somewhat limited despite the obvious usefulness of such data in conducting research on natural language processing of learner English in recent years. In particular learner corpora tagged with grammatical errors are rare because of the difficulties inherent in learner corpus creation as will be described in Sect. 2. As shown in Table 1 error-tagged learner corpora are very few among existing learner corpora see Leacock et al. 2010 for a more detailed discussion of learner corpora . Even if data is error-tagged 1210 it is often not available to the public or its access is severely restricted. For example the Cambridge Learner Corpus which is one of the largest error-tagged learner corpora can only be used by .