tailieunhanh - Báo cáo khoa học: "EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)∗"

We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(t|w). We test the method on the task of full morphological disambiguation in Hebrew achieving an error reduction of 25% over a strong uniform distribution baseline. We also test the same method on the standard WSJ unsupervised POS tagging task and obtain results competitive with recent state-ofthe-art methods, while using simple and efficient learning methods. . | EM Can Find Pretty Good HMM POS-Taggers When Given a Good Start Yoav Goldberg and Meni Adler and Michael Elhadad Ben Gurion University of the Negev Department of Computer Science POB 653 Be er Sheva 84105 Israel yoavg adlerm elhadad @ Abstract We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p t w . We test the method on the task of full morphological disambiguation in Hebrew achieving an error reduction of 25 over a strong uniform distribution baseline. We also test the same method on the standard WSJ unsupervised POS tagging task and obtain results competitive with recent state-of-the-art methods while using simple and efficient learning methods. 1 Introduction The task of unsupervised or semi-supervised part-of-speech POS tagging is the following given a dictionary mapping words in a language to their possible POS and large quantities of unlabeled text data learn to predict the correct part of speech for a given word in context. The only supervision given to the learning process is the dictionary which in a realistic scenario contains only part of the word types observed in the corpus to be tagged. Unsupervised POS tagging has been traditionally approached with relative success Merialdo 1994 Kupiec 1992 by HMM-based generative models employing EM parameters estimation using the Baum-Welch algorithm. However as recently noted This work is supported in part by the Lynn and William Frankel Center for Computer Science. by Banko and Moore 2004 these works made use of filtered dictionaries dictionaries in which only relatively probable analyses of a given word are preserved. This kind of filtering requires serious supervision in theory an expert is needed to go over the dictionary elements and filter out unlikely analyses. In