tailieunhanh - Báo cáo khoa học: "Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering"
An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on context similarity of high frequency words, another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon, a Viterbi POS tagger is trained, which is refined by a morphological component. . | Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering Chris Biemann University of Leipzig NLP Department Augustusplatz 10 11 04109 Leipzig Germany biem@ Abstract An unsupervised part-of-speech POS tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs one based on context similarity of high frequency words another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon a Viterbi POS tagger is trained which is refined by a morphological component. The approach is evaluated on three different languages by measuring agreement with existing taggers. 1 Introduction Motivation Assigning syntactic categories to words is an important pre-processing step for most NLP applications. Essentially two things are needed to construct a tagger a lexicon that contains tags for words and a mechanism to assign tags to running words in a text. There are words whose tags depend on their use. Further we also need to be able to tag previously unseen words. Lexical resources have to offer the possible tags and our mechanism has to choose the appropriate tag based on the context. Given a sufficient amount of manually tagged text several approaches have demonstrated the ability to learn the instance of a tagging mechanism from manually labelled data and apply it successfully to unseen data. Those high-quality resources are typically unavailable for many languages and their creation is labourintensive. We will describe an alternative needing much less human intervention. In this work steps are undertaken to derive a lexicon of syntactic categories from unstructured text without prior linguistic knowledge. We employ two different techniques one for high-and medium frequency terms one for medium-and low frequency
đang nạp các trang xem trước