tailieunhanh - Báo cáo khoa học: "Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages"
This paper examines unsupervised approaches to part-of-speech (POS) tagging for morphologically-rich, resource-scarce languages, with an emphasis on Goldwater and Griffiths’s (2007) fully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and consequently, we propose a weakly supervised fully-Bayesian approach to POS tagging, which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data | Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich Resource-Scarce Languages Kazi Saidul Hasan and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson TX 75083-0688 saidul vince @ Abstract This paper examines unsupervised approaches to part-of-speech POS tagging for morphologically-rich resource-scarce languages with an emphasis on Goldwater and Griffiths s 2007 Pully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon and consequently we propose a weakly supervised fully-Bayesian approach to POS tagging which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data. Since such relaxation comes at the expense of a drop in tagging accuracy we propose two extensions to the Bayesian framework and demonstrate that they are effective in improving a fully-Bayesian POS tagger for Bengali our representative morphologically-rich resource-scarce language. 1 Introduction Unsupervised POS tagging requires neither manual encoding of tagging heuristics nor the availability of data labeled with POS information. Rather an unsupervised POS tagger operates by only assuming as input a POS lexicon which consists of a list of possible POS tags for each word. As we can see from the partial POS lexicon for English in Figure 1 the is unambiguous with respect to POS tagging since it can only be a determiner DT whereas sting is ambiguous since it can be a common noun NN a proper noun NNP or a verb VB . In other words the lexicon imposes constraints on the possible POS tags Word POS tag s running NN JJ sting NN NNP VB the DT Figure 1 A partial lexicon for English of each word and such constraints are then used by an unsupervised tagger to label a new sentence. Conceivably tagging accuracy decreases with the increase in ambiguity unambiguous words .
đang nạp các trang xem trước