tailieunhanh - Báo cáo khoa học: "Smoothing a Tera-word Language Model"

Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation. | Smoothing a Tera-word Language Model Deniz Yuret Koc University dyuret@ Abstract Frequency counts from very large corpora such as the Web 1T dataset have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of Mackay and Peto 1995 with the modified back-off estimates of Kneser and Ney 1995 that leads to a 31 perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting. 1 Introduction Language models . models that assign probabilities to sequences of words have been proven useful in a variety of applications including speech recognition and machine translation Bahl et al. 1983 Brown et al. 1990 . More recently good results on lexical substitution and word sense disambiguation using language models have also been reported Yuret 2007 . The recently introduced Web 1T 5-gram dataset Brants and Franz 2006 contains the counts of word sequences up to length five in a 1012 word corpus derived from publicly accessible Web pages. As this corpus is several orders of magnitude larger than the ones used in previous language modeling studies it holds the promise to provide more accurate domain independent probability estimates. How ever naive application of the well-known smoothing methods do not realize the full potential of this dataset. In this paper I present experiments with modifications and combinations of various smoothing methods using the Web 1T dataset for model building and the Brown corpus for evaluation. I describe a new smoothing method Dirichlet-Kneser-Ney DKN that combines the Bayesian intuition of MacKay and Peto 1995 and the improved back-off estimation of Kneser and Ney 1995 and gives significantly .