Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. . | A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context Masaaki NAGATA NTT Cyber Space Laboratories 1-1 Hikari-no-oka Yokosuka-Shi Kanagawa 239-0847 Japan nagataOnttnly.isl.ntt.co.jp Abstract We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese kanji and phonograms like English katakana . Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. The model can achieve 96.6 tagging accuracy if unknown words are correctly segmented. 1 Introduction In Japanese around 95 word segmentation accuracy is reported by using a word-based language model and the Viterbi-like dynamic programming procedures Nagata 1994 Yamamoto 1996 Takeuchi and Matsumoto 1997 Haruno and Matsumoto 1997 . About the same accuracy is reported in Chinese by statistical methods Sproat et al. 1996 . But there has been relatively little improvement in recent years because most of the remaining errors are due to unknown words. There are two approaches to solve this problem to increase the coverage of the dictionary Fung and Wu 1994 Chang et al. 1995 Mori and Nagao 1996 and to design a better model for unknown words Nagata 1996 Sproat et al. 1996 . We take the latter approach. To improve word segmentation accuracy Nagata 1996 used a single general purpose unknown word model while Sproat et al. 1996 used a set of specific word models such as for plurals personal names and transliterated foreign words. The goal of our research is to assign a correct part of speech to unknown word as well as identifying it correctly. In this paper we present a novel statistical model for Japanese unknown words. It .