tailieunhanh - Báo cáo khoa học: "Determining Word Sense Dominance Using a Thesaurus"

The degree of dominance of a sense of a word is the proportion of occurrences of that sense in text. We propose four new methods to accurately determine word sense dominance using raw text and a published thesaurus. Unlike the McCarthy et al. (2004) system, these methods can be used on relatively small target texts, without the need for a similarly-sensedistributed auxiliary text. We perform an extensive evaluation using artificially generated thesaurus-sense-tagged data. In the process, we create a word–category cooccurrence matrix, which can be used for unsupervised word sense disambiguation and estimating distributional similarity of word senses, as. | Determining Word Sense Dominance Using a Thesaurus Saif Mohammad and Graeme Hirst Department of Computer Science University of Toronto Toronto ON M5S 3G4 Canada smm gh @ Abstract The degree of dominance of a sense of a word is the proportion of occurrences of that sense in text. We propose four new methods to accurately determine word sense dominance using raw text and a published thesaurus. Unlike the McCarthy et al. 2004 system these methods can be used on relatively small target texts without the need for a similarly-sense-distributed auxiliary text. We perform an extensive evaluation using artificially generated thesaurus-sense-tagged data. In the process we create a word-category cooccurrence matrix which can be used for unsupervised word sense disambiguation and estimating distributional similarity of word senses as well. 1 Introduction The occurrences of the senses of a word usually have skewed distribution in text. Further the distribution varies in accordance with the domain or topic of discussion. For example the assertion of illegality sense of charge is more frequent in the judicial domain while in the domain of economics the expense cost sense occurs more often. Formally the degree of dominance of a particular sense of a word target word in a given text target text may be defined as the ratio of the occurrences of the sense to the total occurrences of the target word. The sense with the highest dominance in the target text is called the predominant sense of the target word. Determination of word sense dominance has many uses. An unsupervised system will benefit by backing off to the predominant sense in case of insufficient evidence. The dominance values may be used as prior probabilities for the different senses obviating the need for labeled training data in a sense disambiguation task. Natural language systems can choose to ignore infrequent senses of words or consider only the most dominant senses McCarthy et al. 2004 . An .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN