tailieunhanh - Báo cáo khoa học: "Arabic Language Modeling with Finite State Transducers"

In morphologically rich languages such as Arabic, the abundance of word forms resulting from increased morpheme combinations is significantly greater than for languages with fewer inflected forms (Kirchhoff et al., 2006). This exacerbates the out-of-vocabulary (OOV) problem. Test set words are more likely to be unknown, limiting the effectiveness of the model. The goal of this study is to use the regularities of Arabic inflectional morphology to reduce the OOV problem in that language. | Arabic Language Modeling with Finite State Transducers Ilana Heintz Department of Linguistics The Ohio State University Columbus OH Abstract In morphologically rich languages such as Arabic the abundance of word forms resulting from increased morpheme combinations is significantly greater than for languages with fewer inflected forms Kirchhoff et al. 2006 . This exacerbates the out-of-vocabulary OOV problem. Test set words are more likely to be unknown limiting the effectiveness of the model. The goal of this study is to use the regularities of Arabic inflectional morphology to reduce the OOV problem in that language. We hope that success in this task will result in a decrease in word error rate in Arabic automatic speech recognition. 1 Introduction The task of language modeling is to predict the next word in a sequence of words Jelinek et al. 1991 . Predicting words that have not yet been seen is the main obstacle Gale and Sampson 1995 and is called the Out of Vocabulary OOV problem. In morphologically rich languages the OOV problem is worsened by the increased number of morpheme combinations. Berton et al. 1996 and Geutner 1995 approached this problem in German finding that language models built on decomposed words reduce the OOV rate of a test set. In Carki et al. 2000 Turkish words are split into syllables for language modeling also reducing the OOV rate but not improving This work was supported by a student-faculty fellowship from the AFRL Dayton Area Graduate Studies Insititute and worked on in partnership with Ray Slyh and Tim Anderson of the Air Force Research Labs. WER . Morphological decomposition is also used to boost language modeling scores in Korean Kwon 2000 and Finnish Hirsimaki et al. 2006 . We approach the processing of Arabic morphology both inflectional and derivational with finite state machines FSMs . We use a technique that produces many morphological analyses for each word retaining information about possible stems affixes