tailieunhanh - Báo cáo khoa học: "Minimized Models for Unsupervised Part-of-Speech Tagging"
We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets (a 45-tagset as well as a smaller 17-tagset), and show that our approach performs better than existing state-of-the-art systems in both settings. | Minimized Models for Unsupervised Part-of-Speech Tagging Sujith Ravi and Kevin Knight University of Southern California Information Sciences Institute Marina del Rey California 90292 sravi knight @ Abstract We describe a novel method for the task of unsupervised POS tagging with a dictionary one that uses integer programming to explicitly search for the smallest model that explains the data and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets a 45-tagset as well as a smaller 17-tagset and show that our approach performs better than existing state-of-the-art systems in both settings. 1 Introduction In recent years we have seen increased interest in using unsupervised methods for attacking different NLP tasks like part-of-speech POS tagging. The classic Expectation Maximization EM algorithm has been shown to perform poorly on POS tagging when compared to other techniques such as Bayesian methods. In this paper we develop new methods for unsupervised part-of-speech tagging. We adopt the problem formulation of Merialdo 1994 in which we are given a raw word sequence and a dictionary of legal tags for each word type. The goal is to tag each word token so as to maximize accuracy against a gold tag sequence. Whether this is a realistic problem set-up is arguable but an interesting collection of methods and results has accumulated around it and these can be clearly compared with one another. We use the standard test set for this task a 24 115-word subset of the Penn Treebank for which a gold tag sequence is available. There are 5 878 word types in this test set. We use the standard tag dictionary consisting of 57 388 word tag pairs derived from the entire Penn 8 910 dictionary entries are relevant to the 5 878 word types in the test set. Per-token ambiguity is about tags token yielding approximately 106425 possible ways to tag the data. There are 45 distinct grammatical tags. In .
đang nạp các trang xem trước