tailieunhanh - Báo cáo khoa học: "Language Model Based Arabic Word Segmentation"

We approximate Arabic’s rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. | Language Model Based Arabic Word Segmentation Young-Suk Lee Kishore Papineni Salim Roukos IBM T. J. Watson Research Center Yorktown Heights NY 10598 Ossama Emam Hany Hassan IBM Cairo Technology Development Center 166 El-Ahram Giza Egypt Abstract We approximate Arabic s rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix -stem-suffix denotes zero or more occurrences of a morpheme . Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110 000 words. To improve the segmentation accuracy we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97 exact match accuracy on a test corpus containing 28 449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest. 1 Introduction Morphologically rich languages like Arabic present significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes . prefix stem suffix . By segmenting words into morphemes we can improve the performance of natural language systems including machine translation Brown et al. 1993 and information retrieval Franz M. and McCarley S. 2002 . In this paper we present a general word segmentation algorithm for handling inflectional morphology capable of segmenting a word