tailieunhanh - Báo cáo khoa học: "A global model for joint lemmatization and part-of-speech prediction"

We present a global joint model for lemmatization and part-of-speech prediction. Using only morphological lexicons and unlabeled data, we learn a partiallysupervised part-of-speech tagger and a lemmatizer which are combined using features on a dynamically linked dependency structure of words. We evaluate our model on English, Bulgarian, Czech, and Slovene, and demonstrate substantial improvements over both a direct transduction approach to lemmatization and a pipelined approach, which predicts part-of-speech tags before lemmatization. . | A global model for joint lemmatization and part-of-speech prediction Kristina Toutanova Microsoft Research Redmond WA 98052 kristout@ Colin Cherry Microsoft Research Redmond WA 98052 colinc@ Abstract We present a global joint model for lemmatization and part-of-speech prediction. Using only morphological lexicons and unlabeled data we learn a partially-supervised part-of-speech tagger and a lemmatizer which are combined using features on a dynamically linked dependency structure of words. We evaluate our model on English Bulgarian Czech and Slovene and demonstrate substantial improvements over both a direct transduction approach to lemmatization and a pipelined approach which predicts part-of-speech tags before lemmatization. 1 Introduction The traditional problem of morphological analysis is given a word form to predict the set of all of its possible morphological analyses. A morphological analysis consists of a part-of-speech tag POS possibly other morphological features and a lemma basic form corresponding to this tag and features combination see Table 1 for examples . We address this problem in the setting where we are given a morphological dictionary for training and can additionally make use of un-annotated text in the language. We present a new machine learning model for this task setting. In addition to the morphological analysis task we are interested in performance on two subtasks tag-set prediction predicting the set of possible tags of words and lemmatization predicting the set of possible lemmas . The result of these subtasks is directly useful for some If we are interested in the results of each of these two 1 Tag sets are useful for example as a basis of sparsityreducing features for text labeling tasks lemmatization is useful for information retrieval and machine translation from a morphologically rich to a morphologically poor language where full analysis may not be important. subtasks in isolation we might

TỪ KHÓA LIÊN QUAN