tailieunhanh - Báo cáo khoa học: "Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction"

For a given category, choose a small set of exemplars (or 'seed words') 2. Count co-occurrence of words and seed words within a corpus 3. Use a figure of merit based upon these counts to select new seed words 4. Return to step 2 and iterate n times 5. Use a figure of merit to rank words for category membership and o u t p u t a ranked list Our algorithm uses roughly this same generic structure, but achieves notably superior results, by changing the specifics of: what counts as co-occurrence; which figures of merit to use for. | Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction Brian Roark Cognitive and Linguistic Sciences Box 1978 Brown University Providence RI 02912 USA Brian_Roark@Brown. edu Abstract Generating semantic lexicons semi-automatically could be a great time saver relative to creating them by hand. In this paper we present an algorithm for extracting potential entries for a category from an on-line corpus based upon a small set of exemplars. Our algorithm finds more correct terms and fewer incorrect ones than previous work in this area. Additionally the entries that are generated potentially provide broader coverage of the category than would occur to an individual coding them by hand. Our algorithm finds many terms not included within Wordnet many more than previous algorithms and could be viewed as an enhancer of existing broad-coverage resources. 1 Introduction Semantic lexicons play an important role in many natural language processing tasks. Effective lexicons must often include many domainspecific terms so that available broad coverage resources such as Wordnet Miller 1990 are inadequate. For example both Escort and Chinook are among other things types of vehicles a car and a helicopter respectively but neither are cited as so in Wordnet. Manually building domain-specific lexicons can be a costly time-consuming affair. Utilizing existing resources such as on-line corpora to aid in this task could improve performance both by decreasing the time to construct the lexicon and by improving its quality. Extracting semantic information from word co-occurrence statistics has been effective particularly for sense disambiguation Schiitze 1992 Gale et al. 1992 Yarowsky 1995 . In Riloff and Shepherd 1997 noun co-occurrence statistics were used to indicate nominal cate- Eugene Charniak Computer Science Box 1910 Brown University Providence RI 02912 USA gory membership for the purpose of aiding in the construction of semantic .

TÀI LIỆU LIÊN QUAN