tailieunhanh - Báo cáo khoa học: "Tag set P.eduction Without Information Loss"

A technique for reducing a tagset used for n-gram part-of-speech disambiguation is introduced and evaluated in an experiment. The technique ensures that all information that is provided by the original tagset can be restored from the reduced one. This is crucial, since we are interested in the linguistically motivated tags for part-of-speech disambiguation. The reduced tagset needs fewer parameters for its statistical model and allows more accurate parameter estimation. Additionally, there is a slight but not significant improvement of tagging accuracy. . | Tagset Reduction Without Information Loss Thorsten Brants Universitãt des Saarlandes Computerlinguistik D-66041 Saarbriicken Germany thorsten Abstract A technique for reducing a tagset used for n-gram part-of-speech disambiguation is introduced and evaluated in an experiment. The technique ensures that all information that is provided by the original tagset can be restored from the reduced one. This is crucial since we are interested in the linguistically motivated tags for part-of-speech disambiguation. The reduced tagset needs fewer parameters for its statistical model and allows more accurate parameter estimation. Additionally there is a slight but not significant improvement of tagging accuracy. 1 Motivation Statistical part-of-speech disambiguation can be efficiently done with n-gram models Church 1988 Cutting et al. 1992 . These models are equivalent to Hidden Markov Models HMMs Rabiner 1989 of order n 1. The states represent parts of speech categories tags there is exactly one state for each category and each state outputs words of a particular category. The transition and output probabilities of the HMM are derived from smoothed frequency counts in a text corpus. Generally the categories for part-of-speech tagging are linguistically motivated and do not reflect the probability distributions or co-occurrence probabilities of words belonging to that category. It is an implicit assumption for statistical part-of-speech tagging that words belonging to the same category have similar probability distributions. But this assumption does not hold in many of the cases. Take for example the word cliff which could be a proper NP 1 or a common noun NN ignoring capitalization of proper nouns for the moment . The two previous words are a determiner AT and an 1A11 tag names used in this paper are inspired by those used for the LOB Corpus Garside et al. 1987 . adjective J J . The probability of cliff being a common noun is the product of the respective .