tailieunhanh - Báo cáo khoa học: "Mistake-Driven Mixture of Hierarchical Tag Context Trees "

This paper proposes a mistake-driven mixture method for learning a tag model. The method iteratively performs two procedures: 1. constructing a tag model based on the current data distribution and 2. updating the distribution by focusing on data that are not well predicted by the constructed model. The final tag model is constructed by mixing all the models according to their performance. 1 | Mistake-Driven Mixture of Hierarchical Tag Context Trees Masahiko Haruno NTT Communication Science Laboratories 1-1 Hikari-No-Oka Yokosuka-Shi Kanagawa 239 Japan Yuji Matsumoto NAIST 8916-5 Takayama-cho Ikoma-Shi Nara 630-01 Japan Abstract This paper proposes a mistake-driven mixture method for learning a tag model. The method iteratively performs two procedures 1. constructing a tag model based on the current data distribution and 2. updating the distribution by focusing on data that are not well predicted by the constructed model. The final tag model is constructed by mixing all the models according to their performance. To well reflect the data distribution we represent each tag model as a hierarchical tag proper noun noun context tree. By using the hierarchical tag context tree the constituents of sequential tag models gradually change from broad coverage tags . noun to specific exceptional words that cannot be captured by general tags. In other words the method incorporates not only frequent connections but also infrequent ones that are often considered to be collocational. We evaluate several tag models by implementing Japanese part-of-speech taggers that share all other conditions . dictionary and word model other than their tag models. The experimental results show the proposed method significantly outperforms both hand-crafted and conventional statistical methods. 1 Introduction The last few years have seen the great success of stochastic part-of-speech POS taggers Church 1988 Kupiec 1992 Charniak et al. 1993 Brill 1992 Nagata 1994 . The stochastic approach generally attains 94 to 96 accuracy and replaces the labor-intensive compilation of linguistics rules by using an automated learning algorithm. However 1NTT is an abbreviation of Nippon Telegraph and Telephone Corporation. practical systems require more accuracy because POS tagging is an inevitable pre-processing step for all practical

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN