tailieunhanh - Báo cáo khoa học: "CS NIPER Annotation-by-query for non-canonical constructions in large corpora"

We present CS NIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. | CSniper Annotation-by-query for non-canonical constructions in large corpora Sabine Bartsch English linguistics Department of Linguistics and Literary Studies Technische Universitat Darmstadt http Richard Eckart de Castilho Iryna Gurevych Ubiquitous Knowledge Processing Lab UKP-TUDA Department of Computer Science Technische Universitat Darmstadt http Abstract We present CSniper Corpus Sniper a tool that implements i a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and ii evaluation of annotation quality by measuring inter-rater agreement. This annotation-by-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools. 1 Introduction Linguistic annotation by means of automatic procedures such as part-of-speech POS tagging is a backbone of modern corpus linguistics POS tagged corpora enhance the possibilities of corpus query. However many linguistic phenomena are not amenable to automatic annotation and are not readily identifiable on the basis of surface features. Non-canonical constructions NCCs which are the use-case of the tool presented in this paper are a case in point. NCCs of which cleft-sentences are a well-known example raise a number of issues that prevent their reliable automatic identification in corpora. Yet they warrant corpus study due to the relatively low frequency of individual instances their deviation from canonical construction patterns and frequent ambiguity. This makes them hard to distinguish from other seemingly similar constructions. Expert knowledge is thus required to reliably identify and annotate such phenomena in sufficiently large corpora like the 100 mil. word British National Corpus BNC Consortium 2007 . This necessitates manual annotation which is time-consuming and .

TỪ KHÓA LIÊN QUAN