tailieunhanh - Báo cáo khoa học: "Finding Predominant Word Senses in Untagged Text"

In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of handtagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. . | Finding Predominant Word Senses in Untagged Text Diana McCarthy Rob Koeling Julie Weeds John Carroll Department of Informatics University of Sussex Brighton bNi 9QH UK dianam robk juliewe johnca @ Abstract In word sense disambiguation wsd the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant or first sense heuristic aside from the fact that it does not take surrounding context into account is that it assumes some quantity of hand-tagged data. Whilst there are a few hand-tagged corpora available for some languages one would expect the frequency distribution of the senses of words particularly topical words to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64 on the nouns of the SENSEVAL-2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text such as SemCor. Furthermore we demonstrate that our method discovers appropriate predominant senses for words from two domainspecific corpora. 1 Introduction The first sense heuristic which is often used as a baseline for supervised WSD systems outperforms many of these systems which take surrounding context into account. This is shown by the results of the English all-words task in SENSEVAL-2 Cotton et al. 1998 in figure 1 below where the first sense is that listed in WordNet for the PoS given by the Penn TreeBank Palmer et al. 2001 . The senses in WordNet are ordered according to the frequency data in the manually tagged resource Sem-Cor Miller et al. 1993 . Senses that have not occurred in SemCor are ordered arbitrarily and after those senses of the word that have occurred. The figure distinguishes systems which

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.