tailieunhanh - Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification"

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. | Rethinking Chinese Word Segmentation Tokenization Character Classification or Wordbreak Identification Chu-Ren Huang Petr Simon Institute of Linguistics Institute of Linguistics Academia Sinica Taiwan Academia Sinica Taiwan churen@ sim@ Shu-Kai Hsieh Laurent Prevot DoFLAL CLLE-ERSS CNRS NIU Taiwan Universite de Toulouse France shukai@ prevot@ Abstract This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries CB s into either word-boundaries WB s and non-word-boundaries. In Chinese CB s are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB s are WB s. 1 Introduction modeling and theoretical challenges The fact that word segmentation remains a main research topic in the field of Chinese language processing indicates that there maybe unresolved theoretical and processing issues. In terms of processing the fact is that none of exiting algorithms is robust enough to reliably segment unfamiliar types of texts before fine-tuning with massive training data. It is true that performance of participating teams have steadily improved since the first SigHAN Chinese segmentation bakeoff Sproat and Emerson 2004 . Bakeoff 3 in 2006 produced best f-scores at 95 and higher. However these can only be achieved after training with the pre-segmented training dataset. This is still very far away from real-world .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.