tailieunhanh - Báo cáo khoa học: "Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data"

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. | Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data Sun Maosong Shen Dayang Benjamin K Tsou State Key Laboratory of Intelligent Technology and Systems Tsinghua University Beijing China Email lkc-dcs@ Computer Science Institute Shantou University Guangdong China Language Information Sciences Research Centre City University of Hong Kong Hong Kong Abstract Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm that is mutual information and the difference of t-score between characters is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaming of this approach will be beneficial to improving the performance especially in ability to cope with unknown words and ability to adapt to various domains of the existing segmenters though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications. 1. Introduction Any Chinese word is composed of either single or multiple characters. Chinese texts are explicitly concatenations of characters words are not delimited by spaces as that in English. Chinese word segmentation is therefore the first step for any Chinese information processing system 1 Almost all methods for Chinese word segmentation developed so far both statistical and rule-based exploited two kinds of important resources . lexicon and hand-crafted linguistic resources manually segmented and tagged corpus knowledge for unknown words and linguistic This work was supported in part by the National Natural Science Foundation of China under grant No. 69433010. rules l 2 3 5 6 8 9 10 . Lexicon is usually used as the means for finding segmentation candidates for input sentences while linguistic

TÀI LIỆU LIÊN QUAN