tailieunhanh - Báo cáo khoa học: "Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation"
In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the effects of applying such a technique to higherorder n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially class-based models and a distributed version of this algorithm to efficiently obtain automatic word classifications for large vocabularies (1 million words) using such large training corpora (30 billion tokens). . | Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation Jakob Uszkoreit Thorsten Brants Google Inc. 1600 Amphitheatre Parkway Mountain View CA 94303 USA uszkoreit brants @ Abstract In statistical language modeling one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the effects of applying such a technique to higher-order n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially class-based models and a distributed version of this algorithm to efficiently obtain automatic word classifications for large vocabularies 1 million words using such large training corpora 30 billion tokens . The resulting clusterings are then used in training partially class-based language models. We show that combining them with wordbased n-gram models in the log-linear model of a state-of-the-art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score. 1 Introduction A statistical language model assigns a probability P w to any given string of words w wi . wm. In the case of n-gram language models this is done by factoring the probability P wm n P wiiwi-1 i 1 and making a Markov assumption by approximating this by m m nP wi wi-i n pcwK-n i i 1 i 1 Even after making the Markov assumption and thus treating all strings of preceding words as equal which Parts of this research were conducted while the author studied at the Berlin Institute of Technology do not differ in the last n 1 words one problem ngram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P wj wi-1 . Class-based n-gram models are intended to help overcome this data sparsity problem by grouping words into equivalence classes rather than treating them as distinct
đang nạp các trang xem trước