tailieunhanh - Báo cáo khoa học: "Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model"

A solution to the problem of homograph (words with multiple distinct meanings) identification is proposed and evaluated in this paper. It is demonstrated that a mixture model based framework is better suited for this task than the standard classification algorithms – relative improvement of 7% in F1 measure and 14% in Cohen’s kappa score is observed. | Dictionary Definitions based Homograph Identification using a Generative Hierarchical Model Anagha Kulkarni Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave Pittsburgh Pa 15213 USA anaghak callan @ Abstract A solution to the problem of homograph words with multiple distinct meanings identification is proposed and evaluated in this paper. It is demonstrated that a mixture model based framework is better suited for this task than the standard classification algorithms -relative improvement of 7 in F1 measure and 14 in Cohen s kappa score is observed. 1 Introduction Lexical ambiguity resolution is an important research problem for the fields of information retrieval and machine translation Sanderson 2000 Chan et al. 2007 . However making fine-grained sense distinctions for words with multiple closely-related meanings is a subjective task Jorgenson 1990 Palmer et al. 2005 which makes it difficult and error-prone. Fine-grained sense distinctions aren t necessary for many tasks thus a possibly-simpler alternative is lexical disambiguation at the level of homographs Ide and Wilks 2006 . Homographs are a special case of semantically ambiguous words Words that can convey multiple distinct meanings. For example the word bark can imply two very different concepts - outer layer of a tree trunk or the sound made by a dog and thus is a homograph. Ironically the definition of the word homograph is itself ambiguous and much debated however in this paper we consistently use the above definition. If the goal is to do word-sense disambiguation of homographs in a very large corpus a manually-generated homograph inventory may be impractical. In this case the first step is to determine which words in a lexicon are homographs. This problem is the subject of this paper. 2 Finding the Homographs in a Lexicon Our goal is to identify the homographs in a large lexicon. We assume that manual labor is a scarce resource