tailieunhanh - Báo cáo khoa học: "Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese"

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into DecisionTree Dictionary-less morphological analysis. By using natural classes, we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text. . | Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese Hideki Kashioka Yasuhiro Kawata Yumiko Kinjo Andrew Finch and Ezra w. Black kashioka ykawata kinjo finch black @ ATR Interpreting Telecommunications Reserach Laboratories Abstract For languages whose character set is very large and whose orthography does not require spacing between words such as Japanese tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem uncontrolled heuristics are primarily used. The use of information on character sorts however mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into Decision-Tree Dictionary-less morphological analysis. By using natural classes we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text. 1 Introduction Recent papers have reported cases of successful part-of-speech tagging with statistical language modeling techniques Church 1988 Cutting et. al. 1992 Charniak et. al. 1993 Brill 1994 Nagata 1994 Yamamoto 1996 . Morphological analysis on Japanese however is more complex because unlike European languages no spaces are inserted between words. In fact even native Japanese speakers place word boundaries inconsistently. Consequently individual researchers have been adopting different word boundaries and tag sets based on their own theory-internal justifications. For a practical system to utilize the different word boundaries and tag sets according to the demands of an application it is necessary to coordinate the dictionary used tag sets and numerous other parameters. Unfortunately such a task is costly. Furthermore it is difficult to maintain the accuracy needed to regulate the word boundaries. Also depending on the pur pose new technical terminology may have to be collected .