tailieunhanh - Báo cáo khoa học: "Deriving an Ambiguous Word’s Part-of-Speech Distribution from Unannotated Text"

A distributional method for part-of-speech induction is presented which, in contrast to most previous work, determines the part-of-speech distribution of syntactically ambiguous words without explicitly tagging the underlying text corpus. This is achieved by assuming that the word pair consisting of the left and right neighbor of a particular token is characteristic of the part of speech at this position, and by clustering the neighbor pairs on the basis of their middle words as observed in a large corpus. The results obtained in this way are evaluated by comparing them to the part-of-speech distributions as found in the manually. | Deriving an Ambiguous Word s Part-of-Speech Distribution from Unannotated Text Reinhard Rapp Universitat Rovira i Virgili Pl. Imperial Tarraco 1 E-43005 Tarragona Spain Abstract A distributional method for part-of-speech induction is presented which in contrast to most previous work determines the part-of-speech distribution of syntactically ambiguous words without explicitly tagging the underlying text corpus. This is achieved by assuming that the word pair consisting of the left and right neighbor of a particular token is characteristic of the part of speech at this position and by clustering the neighbor pairs on the basis of their middle words as observed in a large corpus. The results obtained in this way are evaluated by comparing them to the part-of-speech distributions as found in the manually tagged Brown corpus. 1 Introduction The purpose of this study is to automatically induce a system of word classes that is in agreement with human intuition and then to assign all possible parts of speech to a given ambiguous or unambiguous word. Two of the pioneering studies concerning this as yet not satisfactorily solved problem are Finch 1993 and Schutze 1993 who classify words according to their context vectors as derived from a corpus. More recent studies try to solve the problem of POS induction by combining distributional and morphological information Clark 2003 Freitag 2004 or by clustering words and projecting them to POS vectors Rapp 2005 . Whereas all these studies are based on global co-occurrence vectors who reflect the overall behavior of a word in a corpus . who in the case of syntactically ambiguous words are based on POS-mixtures in this paper we raise the question if it is really necessary to use an approach based on mixtures or if there is some way to avoid the mixing beforehand. For this purpose we suggest to look at local contexts instead of global co-occurrence vectors. As can be seen from human performance in almost all