tailieunhanh - Báo cáo khoa học: "Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation"

We describe the ongoing construction of a large, semantically annotated corpus resource as reliable basis for the largescale acquisition of word-semantic information, . the construction of domainindependent lexica. The backbone of the annotation are semantic roles in the frame semantics paradigm. We report experiences and evaluate the annotated data from the first project stage. On this basis, we discuss the problems of vagueness and ambiguity in semantic annotation. | Towards a Resource for Lexical Semantics A Large German Corpus with Extensive Semantic Annotation Katrin Erk and Andrea Kowalski and Sebastian Pado and Manfred Pinkal Department of Computational Linguistics Saarland University Saarbrucken Germany erk kowalski pado pinkalg@ Abstract We describe the ongoing construction of a large semantically annotated corpus resource as reliable basis for the large-scale acquisition of word-semantic information . the construction of domainindependent lexica. The backbone of the annotation are semantic roles in the frame semantics paradigm. We report experiences and evaluate the annotated data from the first project stage. On this basis we discuss the problems of vagueness and ambiguity in semantic annotation. 1 Introduction Corpus-based methods for syntactic learning and processing are well-established in computational linguistics. There are comprehensive and carefully worked-out corpus resources available for a number of languages . the Penn Treebank Marcus et al. 1994 for English or the NEGRA corpus Skut et al. 1998 for German. In semantics the situation is different Semantic corpus annotation is only in its initial stages and currently only a few mostly small corpora are available. Semantic annotation has predominantly concentrated on word senses . in the SENSEVAL initiative Kilgarriff 2001 a notable exception being the Prague Treebank Hajicova 1998 . As a consequence most recent work in corpus-based semantics has taken an unsupervised approach relying on statistical methods to extract semantic regularities from raw corpora often using information from ontologies like WordNet Miller et al. 1990 . Meanwhile the lack of large domainindependent lexica providing word-semantic information is one of the most serious bottlenecks for language technology. To train tools for the acquisition of semantic information for such lexica large extensively annotated resources are necessary. In this paper we present current .

TỪ KHÓA LIÊN QUAN