Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n − 1 positions. Our final model achieves 27% perplexity reduction compared to the standard n-gram model. . | Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies Deniz Yuret Koc University 34450 Sariyer Istanbul Turkey dyuret@ku.edu.tr Ergun Bicici Koẹ University 34450 Sariyer Istanbul Turkey ebicici@ku.edu.tr Abstract We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models FlexGrams which assume that the n 1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n 1 positions. Our final model achieves 27 perplexity reduction compared to the standard n-gram model. 1 Introduction Language models i.e. models that assign probabilities to sequences of words have been proven useful in a variety of applications including speech recognition and machine translation Bahl et al. 1983 Brown et al. 1990 . More recently good results on lexical substitution and word sense disambiguation using language models have also been reported Hawker 2007 Yuret 2007 . Morphologically rich languages pose a challenge to standard modeling techniques because of their relatively large out-of-vocabulary rates and the regularities they possess at the sub-word level. The standard n-gram language model ignores long-distance relationships between words and uses the independence assumption of a Markov chain of order n 1. Morphemes play an important role in the syntactic dependency structure in morphologically rich languages. The dependencies are not only between stems but also between stems and suffixes and if we use complete words as unit tokens we will not be able to represent these sub-word dependencies. Our working hypothesis is that the performance of a lan guage model is correlated by how much the probabilistic dependencies mirror the syntactic dependencies. We present flexible n-grams .