tailieunhanh - Báo cáo khoa học: "High Precision Treebanking — Blazing Useful Trees Using POS Information"

In this paper we present a quantitative and qualitative analysis of annotation in the Hinoki treebank of Japanese, and investigate a method of speeding annotation by using part-of-speech tags. The Hinoki treebank is a Redwoods-style treebank of Japanese dictionary definition sentences. 5,000 sentences are annotated by three different annotators and the agreement evaluated. An average agreement of was found using strict agreement, and using labeled precision. Exploiting POS tags allowed the annotators to choose the best parse with fewer decisions. . | High Precision Treebanking Blazing Useful Trees Using POS Information Takaaki Tanaka t Francis Bond Stephan Oepen Sanae Fujitat t takaaki bond fujita @ oe@ t NTT Communication Science Laboratories Nippon Telegraph and Telephone Corporation Universitetet i Oslo and CSLI Stanford Abstract In this paper we present a quantitative and qualitative analysis of annotation in the Hinoki treebank of Japanese and investigate a method of speeding annotation by using part-of-speech tags. The Hinoki treebank is a Redwoods-style treebank of Japanese dictionary definition sentences. 5 000 sentences are annotated by three different annotators and the agreement evaluated. An average agreement of was found using strict agreement and using labeled precision. Exploiting POS tags allowed the annotators to choose the best parse with fewer decisions. 1 Introduction It is important for an annotated corpus that the markup is both correct and in cases where variant analyses could be considered correct consistent. Considerable research in the field of word sense disambiguation has concentrated on showing that the annotation of word senses can be done correctly and consistently with the normal measure being interannotator agreement . Kilgariff and Rosenzweig 2000 . Surprisingly few such studies have been carried out for syntactic annotation with the notable exceptions of Brants et al. 2003 p 82 for the German NeGra Corpus and Civit et al. 2003 for the Spanish Cast3LB corpus. Even such valuable and widely used corpora as the Penn TreeBank have not been verified in this way. We are constructing the Hinoki treebank as part of a larger project in cognitive and computational lin guistics ultimately aimed at natural language understanding Bond et al. 2004 . In order to build the initial syntactic and semantic models we are treebanking the dictionary definition sentences of the most familiar 28 000 words of Japanese and building an ontology from