tailieunhanh - Báo cáo khoa học: "Subword-based Tagging for Confidence-dependent Chinese Word Segmentation"

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all comparative experiments. In addition, we proposed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmentation. . | Subword-based Tagging for Confidence-dependent Chinese Word Segmentation Ruiqiang Zhang1 2 and Genichiro Kikui and Eiichiro Sumita1 2 1National Institute of Information and Communications Technology 2ATR Spoken Language Communication Research Laboratories 2-2-2 Hikaridai Seiika-cho Soraku-gun Kyoto 619-0288 Japan @ Abstract We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy MaxEnt and the conditional random fields CRF methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all comparative experiments. In addition we proposed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmentation. This approach can produce an ideal tradeoff between the in-vocaulary rate and out-of-vocabulary rate. Our techniques were evaluated using the test data from Sighan Bakeoff 2005. We achieved higher F-scores than the best results in three of the four corpora PKU CITYU and MSR . 1 Introduction Many approaches have been proposed in Chinese word segmentation in the past decades. Segmentation performance has been improved significantly from the earliest maximal match dictionary-based approaches to HMM-based Zhang et al. 2003 approaches and recent state-of-the-art machine learning approaches such as maximum entropy MaxEnt Xue and Shen 2003 support vector machine Now the second author is affiliated with NTT. SVM Kudo and Matsumoto 2001 conditional random fields CRF Peng and McCallum 2004 and minimum error rate training Gao et al. 2004 . By analyzing the top results in the first and second Bakeoffs Sproat and Emerson 2003 and Emerson 2005 we found the top results were produced by direct or indirect use of so-called IOB tagging which converts the problem of word segmentation into one of character tagging so .

TÀI LIỆU LIÊN QUAN