tailieunhanh - Báo cáo khoa học: "Adaptive Chinese Word Segmentation"

For example, speech recogniThis paper presents a Chinese word segmentation system which can adapt to different tion systems prefer “longer words” to achieve domains and standards. We first present a stahigher accuracy whereas information retrieval tistical framework where domain-specific systems prefer “shorter words” to obtain higher words are identified in a unified approach to recall rates, etc. (Wu, 2003). | Adaptive Chinese Word Segmentation1 Jianfeng Gao Andi Wu Mu Li Chang-Ning Huang Hongqiao Li Xinsong Xia Haowei Qin Microsoft Research. jfgao andiwu muli cnhuang @ Beijing Institute of Technology Beijing. lhqtxm@ Peking University Beijing. xia_xinsong@ Shanghai Jiaotong university Shanghai. haoweiqin@ Abstract This paper presents a Chinese word segmentation system which can adapt to different domains and standards. We first present a statistical framework where domain-specific words are identified in a unified approach to word segmentation based on linear models. We explore several features and describe how to create training data by sampling. We then describe a transformation-based learning method used to adapt our system to different word segmentation standards. Evaluation of the proposed system on five test sets with different standards shows that the system achieves state- of-the-art performance on all of them. 1 Introduction Chinese word segmentation has been a longstanding research topic in Chinese language processing. Recent development in this field shows that in addition to ambiguity resolution and unknown word detection the usefulness of a Chinese word segmenter also depends crucially on its ability to adapt to different domains of texts and different segmentation standards. The need of adaptation involves two research issues that we will address in this paper. The first is new word detection. Different domains applications may have different vocabularies which contain new words terms that are not available in a general dictionary. In this paper new words refer to OOV words other than named entities factoids and morphologically derived words. These words are mostly domain specific terms . @ cellular and time-sensitive political social or cultural terms . Hffi Three Links @ SARS . The second issue concerns the customizable display of word segmentation. Different Chinese NLP-enabled applications may have .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.