tailieunhanh - Báo cáo khoa học: "A Hybrid Approach to Word Segmentation and POS Tagging"
In this paper, we present a hybrid method for word segmentation and POS tagging. The target languages are those in which word boundaries are ambiguous, such as Chinese and Japanese. In the method, word-based and character-based processing is combined, and word segmentation and POS tagging are conducted simultaneously. Experimental results on multiple corpora show that the integrated method has high accuracy. | A Hybrid Approach to Word Segmentation and POS Tagging Tetsuji Nakagawa Oki Electric Industry Co. Ltd. 2 5 7 Honmachi Chuo-ku Osaka 541 0053 Japan nakagawa378@ Abstract In this paper we present a hybrid method for word segmentation and POS tagging. The target languages are those in which word boundaries are ambiguous such as Chinese and Japanese. In the method word-based and character-based processing is combined and word segmentation and POS tagging are conducted simultaneously. Experimental results on multiple corpora show that the integrated method has high accuracy. 1 Introduction Part-of-speech POS tagging is an important task in natural language processing and is often necessary for other processing such as syntactic parsing. English POS tagging can be handled as a sequential labeling problem and has been extensively studied. However in Chinese and Japanese words are not separated by spaces and word boundaries must be identified before or during POS tagging. Therefore POS tagging cannot be conducted without word segmentation and how to combine these two processing is an important issue. A large problem in word segmentation and POS tagging is the existence of unknown words. Unknown words are defined as words that are not in the system s word dictionary. It is difficult to determine the word boundaries and the POS tags of unknown words and unknown words often cause errors in these processing. In this paper we study a hybrid method for Chinese and Japanese word segmentation and POS tagging in which word-based and character-based processing is combined and word segmentation and POS tagging are conducted simultaneously. In the method word-based processing is used to handle known words and character-based processing is used to handle unknown words. Furthermore information of word boundaries and POS tags are used at the same time with this method. The following sections describe the hybrid method and results of experiments on Chinese and Japanese corpora. 217
đang nạp các trang xem trước