tailieunhanh - Báo cáo khoa học: "UNSUPERVISED WORD SENSE DISAMBIGUATION RIVALING SUPERVISED METHODS"

This paper presents an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations. The algorithm is based on two powerful constraints - that words tend to have one sense per discourse and one sense per collocation - exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96%. | UNSUPERVISED WORD SENSE DISAMBIGUATION RIVALING SUPERVISED METHODS David Yarowsky Department of Computer and Information Science University of Pennsylvania Philadelphia PA 19104 USA Abstract This paper presents an unsupervised learning algorithm for sense disambiguation that when trained on unannotated English text rivals the performance of supervised techniques that require time-consuming hand annotations. The algorithm is based on two powerful constraints - that words tend to have one sense per discourse and one sense per collocation - exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96 . . 1 Introduction This paper presents an unsupervised algorithm that can accurately disambiguate word senses in a large completely untagged The algorithm avoids the need for costly hand-tagged training data by exploiting two powerful properties of human language 1. One sense per collocation 2 Nearby words provide strong and consistent clues to the sense of a target word conditional on relative distance order and syntactic relationship. 2. One sense per discourse The sense of a target word is highly consistent within any given document. Moreover language is highly redundant so that the sense of a word is effectively overdetermined by 1 and 2 above. The algorithm uses these properties to incrementally identify collocations for target senses of a word given a few seed collocations 1Note that the problem here is sense disambiguation assigning each instance of a word to established sense definitions such as in a dictionary . This differs from sense induction using distributional similarity to partition word instances into clusters that may have no relation to standard sense partitions. 2 Here I use the traditional dictionary definition of collocation - appearing in the same location a juxtaposition of words . No idiomatic or non-compositional interpretation is implied. for each sense This procedure is robust and .

TÀI LIỆU LIÊN QUAN