tailieunhanh - Báo cáo khoa học: "Bootstrapping Coreference Resolution Using Word Associations"

In this paper, we present an unsupervised framework that bootstraps a complete coreference resolution (CoRe) system from word associations mined from a large unlabeled corpus. We show that word associations are useful for CoRe – ., the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in CoRe because it is sparse and difficult to learn from small labeled corpora. | Bootstrapping Coreference Resolution Using Word Associations Hamidreza Kobdani Hinrich Schutze Michael Schiehlen and Hans Kamp Institute for Natural Language Processing University of Stuttgart kobdani@ Abstract In this paper we present an unsupervised framework that bootstraps a complete coreference resolution CoRe system from word associations mined from a large unlabeled corpus. We show that word associations are useful for CoRe - . the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in CoRe because it is sparse and difficult to learn from small labeled corpora. Since unlabeled text is readily available our unsupervised approach addresses the sparseness problem. In a self-training framework we train a decision tree on a corpus that is automatically labeled using word associations. We show that this unsupervised system has better CoRe performance than other learning approaches that do not use manually labeled data. 1 Introduction Coreference resolution CoRe is the process of finding markables noun phrases referring to the same real world entity or concept. Until recently most approaches tried to solve the problem by binary classification where the probability of a pair of markables being coreferent is estimated from labeled data. Alternatively a model that determines whether a markable is coreferent with a preceding cluster can be used. For both pair-based and cluster-based models a well established feature model plays an important role. Typical systems use a rich feature space based on lexical syntactic and semantic knowledge. Most 783 commonly used features are described by Soon et al. 2001 . Most existing systems are supervised systems trained on human-labeled benchmark data sets for English. These systems use linguistic features based on number gender person etc. It is a challenge to adapt these systems to new domains genres and languages because a .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN