tailieunhanh - Báo cáo khoa học: "Automatic Retrieval and Clustering of Similar Words"

Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurns is significantly closer to WordNet than Roget Thesaurus is. | Automatic Retrieval and Clustering of Similar Words Dekang Lin Department of Computer Science University of Manitoba Winnipeg Manitoba Canada R3T 2N2 lindek@ Abstract Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows US to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is. 1 Introduction The meaning of an unknown word can often be inferred from its context. Consider the following slightly modified example in Nida 1975 1 A bottle of tezguino is on the table. Everyone likes tezguino. Tezguino makes you drunk. We make tezguino out of com. The contexts in which the word tezguino is used suggest that tezguino may be a kind of alcoholic beverage made from com mash. Bootstrapping semantics from text is one of the greatest challenges in natural language learning. It has been argued that similarity plays an important role in word acquisition Gentner 1982 . Identifying similar words is an initial step in learning the definition of a word. This paper presents a method for making this first step. For example given a corpus that includes the sentences in 1 our goal is to be able to infer that tezguino is similar to beer wine vodka etc. In addition to the long-term goal of bootstrapping semantics from text automatic identification of similar words has many immediate applications. The most obvious one is thesaurus construction. An automatically created thesaurus offers many advantages over manually constructed thesauri. Firstly the terms can be corpus- or genre-specific. Manually constructed general-purpose dictionaries and thesauri include many usages that are very infrequent in a particular corpus or genre