Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies"

Ðức Khải 80 4 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

We experiment with splitting words into their stem and sufﬁx components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a signiﬁcant perplexity reduction in Turkish. We present ﬂexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n − 1 positions. Our ﬁnal model achieves 27% perplexity reduction compared to the standard n-gram model. . | Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies Deniz Yuret Koc University 34450 Sariyer Istanbul Turkey dyuret@ku.edu.tr Ergun Bicici Koẹ University 34450 Sariyer Istanbul Turkey ebicici@ku.edu.tr Abstract We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models FlexGrams which assume that the n 1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n 1 positions. Our final model achieves 27 perplexity reduction compared to the standard n-gram model. 1 Introduction Language models i.e. models that assign probabilities to sequences of words have been proven useful in a variety of applications including speech recognition and machine translation Bahl et al. 1983 Brown et al. 1990 . More recently good results on lexical substitution and word sense disambiguation using language models have also been reported Hawker 2007 Yuret 2007 . Morphologically rich languages pose a challenge to standard modeling techniques because of their relatively large out-of-vocabulary rates and the regularities they possess at the sub-word level. The standard n-gram language model ignores long-distance relationships between words and uses the independence assumption of a Markov chain of order n 1. Morphemes play an important role in the syntactic dependency structure in morphologically rich languages. The dependencies are not only between stems but also between stems and suffixes and if we use complete words as unit tokens we will not be able to represent these sub-word dependencies. Our working hypothesis is that the performance of a lan guage model is correlated by how much the probabilistic dependencies mirror the syntactic dependencies. We present flexible n-grams .

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Modeling Topic Dependencies in Hierarchical Text Categorization"

Báo cáo khoa học: "Automatic Event Extraction with Structured Preference Modeling"

Báo cáo khoa học: "Modeling Sentences in the Latent Space"

Báo cáo khoa học: "Modeling the Translation of Predicate-Argument Structure for SMT"

Báo cáo khoa học: "Large-Scale Syntactic Language Modeling with Treelets"

Báo cáo khoa học: "Extracting and modeling durations for habits and events from Twitter"

Báo cáo khoa học: "Efﬁcient Tree-Based Topic Modeling"

Báo cáo khoa học: "Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining"

Báo cáo khoa học: "Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach"

Báo cáo khoa học: "Modeling Review Comments"