tailieunhanh - Báo cáo khoa học: "HunPos – an open source trigram tagger"

In the world of non-proprietary NLP software the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the criticism aimed at HMM performance on languages with rich morphology should more properly be directed at TnT’s peculiar license, free but not open source, since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos1 , a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully. | HunPos - an open source trigram tagger Peter Halacsy Budapest U. of Technology MOKK Media Research H-1111 Budapest Stoczek u 2 peter@ Andras Kornai MetaCarta Inc. 350 Massachusetts Ave. Cambridge MA 02139 andras@ Csaba Oravecz Hungarian Academy of Sciences Institute of Linguistics H-1068 Budapest Benczuru. 33. oravecz@ Abstract In the world of non-proprietary NLP software the standard and perhaps the best HMM-based POS tagger is TnT Brants 2000 . We argue here that some of the criticism aimed at HMM performance on languages with rich morphology should more properly be directed at TnT s peculiar license free but not open source since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos1 a free and open source LGPL-licensed alternative which can be tuned by the user to fully utilize the potential of HMM architectures offering performance comparable to more complex models but preserving the ease and speed of the training and tagging process. 0 Introduction Even without a formal survey it is clear that TnT Brants 2000 is used widely in research labs throughout the world Google Scholar shows over 400 citations. For research purposes TnT is freely available but only in executable form closed source . Its greatest advantage is its speed important both for a fast tuning cycle and when dealing with large corpora especially when the POS tagger is but one component in a larger information retrieval information extraction or question answer- 1http resources hunpos 209 ing system. Though taggers based on dependency networks Toutanova et al. 2003 SVM Gimenez and Marquez 2003 MaxEnt Ratnaparkhi 1996 CRF Smith et al. 2005 and other methods may reach slightly better results their train test cycle is orders of magnitude longer. A ubiquitous problem in HMM tagging originates from the standard way of calculating lexical .