tailieunhanh - Báo cáo khoa học: "Chinese Unknown Word Identification Using Character-based Tagging and Chunking"

Since written Chinese has no space to delimit words, segmenting Chinese texts becomes an essential task. During this task, the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First, a morphological analysis is done to obtain initial segmentation and POS tags and then a chunker is used to detect unknown words. | Chinese Unknown Word Identification Using Character-based Tagging and Chunking GOH Chooi Ling Masayuki ASAHARA Yuji MATSUMOTO Graduate School of Information Science Nara Institute of Science and Technology ling-g masayu-a matsu @ Abstract Since written Chinese has no space to delimit words segmenting Chinese texts becomes an essential task. During this task the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First a morphological analysis is done to obtain initial segmentation and POS tags and then a chunker is used to detect unknown words. 1 Introduction Like many other Asian languages Thai Japanese etc written Chinese does not delimit words by spaces and there is no clue to tell where the word boundaries are. Therefore it is usually required to segment Chinese texts prior to further processing. Previous research has been done for segmentation however the results obtained are not quite satisfactory when unknown words occur in the texts. An unknown word is defined as a word that is not found in the dictionary. As for any other language all possibilities of derivational morphology cannot be foreseen in the form of a dictionary with a fixed number of entries. Therefore proper solutions are necessary for the detection of unknown words. Along traditional methods unknown word detection has been done using rules for guessing their location. This can ensure a high precision for the detection of unknown words but unfortunately the recall is not quite satisfactory. It is mainly due to the Chinese language as new patterns can always be created that one can hardly efficiently maintain the rules by hand. Since the introduction of statistical techniques in NLP research has been done on Chinese unknown word detection using such techniques and the results showed that statistical based model .