tailieunhanh - Báo cáo khoa học: "Distribution-Based Pruning of Backoff Language Models"

We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data, we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9% (word perplexity reduction) better than conventional cutoff methods. | Distribution-Based Pruning of Backoff Language Models Jianfeng Gao Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China j fgao@ Kai-Fu Lee Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China kfl@ Abstract We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9 word perplexity reduction better than conventional cutoff methods. 1 Introduction Statistical language modelling SLM has been successfully applied to many domains such as speech recognition Jelinek 1990 information retrieval Miller et al. 1999 and spoken language understanding Zue 1995 . In particular n-gram language model LM has been demonstrated to be highly effective for these domains. N-gram LM estimates the probability of a word given previous words P Wn W1 . Wn-1 . In applying an SLM it is usually the case that more training data will improve a language model. However as training data size increases LM size increases which can lead to models that are too large for practical use. To deal with the problem count cutoff Jelinek 1990 is widely used to prune language models. The cutoff method deletes from the LM those n-grams that occur infrequently in the training data. The cutoff method assumes that if an n-gram is infrequent in training data it is also infrequent in testing data. But in the real world training data rarely matches testing data perfectly. Therefore the count cutoff method is not perfect. In this paper we propose a distribution-based cutoff method. This approach estimates if an n-gram is likely to be infrequent in testing data . To determine this likelihood we divide the training .

TÀI LIỆU LIÊN QUAN