tailieunhanh - Báo cáo khoa học: "PART-OF-SPEECH INDUCTION FROM SCRATCH"

This paper presents a method for inducing the parts of speech of a language and partof-speech labels for individual words from a large text corpus. Vector representations for the part-of-speech of a word are formed from entries of its near lexical neighbors. A dimensionality reduction creates a space representing the syntactic categories of unambiguous words. A neural net trained on these spatial representations classifies individual contexts of occurrence of ambiguous words. The method classifies both ambiguous and unambiguous words correctly with high accuracy. . | PART-OF-SPEECH INDUCTION FROM SCRATCH Hinrich Schiitze Center for the Study of Language and Information Ventura Hall Stanford CA 94305-4115 schuet ze@csli. st anfor d. edu Abstract This paper presents a method for inducing the parts of speech of a language and part-of-speech labels for individual words from a large text corpus. Vector representations for the part-of-speech of a word are formed from entries of its near lexical neighbors. A dimensionality reduction creates a space representing the syntactic categories of unambiguous words. A neural net trained on these spatial representations classifies individual contexts of occurrence of ambiguous words. The method classifies both ambiguous and unambiguous words correctly with high accuracy. INTRODUCTION Part-of-speech information about individual words is necessary for any kind of syntactic and higher level processing of natural language. While it is easy to obtain lists with part of speech labels for frequent English words such information is not available for less common languages. Even for English a categorization of words that is tailored to a particular genre may be desired. Finally there are rare words that need to be categorized even if frequent words are covered by an available electronic dictionary. This paper presents a method for inducing the parts of speech of a language and part-of-speech labels for individual words from a large text corpus. Little if any language-specific knowledge is used so that it is applicable to any language in principle. Since the part-of-speech representations are derived from the corpus the resulting categorization is highly text specific and doesn t contain categories that are inappropriate for the genre in question. The method is efficient enough for vocabularies of tens of thousands of words thus addressing the problem of coverage. The problem of how syntactic categories can be induced is also of theoretical interest in language acquisition and learnability. Syntactic .

TÀI LIỆU LIÊN QUAN