tailieunhanh - Báo cáo khoa học: "An Empirical Investigation of Discounting in Cross-Domain Language Models"

We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines. . | An Empirical Investigation of Discounting in Cross-Domain Language Models Greg Durrett and Dan Klein Computer Science Division University of California Berkeley gdurrett klein @ Abstract We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain discounts are roughly constant matching the assumptions of modified Kneser-Ney LMs. However when training and test corpora diverge the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines. 1 Introduction Discounting or subtracting from the count of each n-gram is one of the core aspects of Kneser-Ney language modeling Kneser and Ney 1995 . For all but the smallest n-gram counts Kneser-Ney uses a single discount one that does not grow with the ngram count because such constant-discounting was seen in early experiments on held-out data Church and Gale 1991 . However due to increasing computational power and corpus sizes language modeling today presents a different set of challenges than it did 20 years ago. In particular modeling crossdomain effects has become increasingly more important Klakow 2000 Moore and Lewis 2010 and deployed systems must frequently process data that is out-of-domain from the standpoint of the language model. In this work we perform experiments on held-out data to evaluate how discounting behaves in the 24 cross-domain setting. We find that when training and testing on corpora that are as similar as possible empirical discounts indeed do not grow with ngram count which validates the parametric assumption of Kneser-Ney smoothing. However when the train and evaluation corpora differ even slightly discounts generally exhibit linear growth in the count of the n-gram with the amount of .

TỪ KHÓA LIÊN QUAN