tailieunhanh - Báo cáo khoa học: "A Hierarchical Bayesian Language Model based on Pitman-Yor Processes"

We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney. . | A Hierarchical Bayesian Language Model based on Pitman-Yor Processes Yee Whye Teh School of Computing National University of Singapore 3 Science Drive 2 Singapore 117543. tehyw@ Abstract We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney. 1 Introduction Probabilistic language models are used extensively in a variety of linguistic applications including speech recognition handwriting recognition optical character recognition and machine translation. Most language models fall into the class of n-gram models which approximate the distribution over sentences using the conditional distribution of each word given a context consisting of only the previous n 1 words T P sentence JJ P word. wordi-n i 1 i 1 with n 3 trigram models being typical. Even for such a modest value of n the number of parameters is still tremendous due to the large vocabulary size. As a result direct maximum-likelihood parameter fitting severely overfits to the training data and smoothing methods are indispensible for proper training of n-gram models. A large number of smoothing methods have been proposed in the literature see Chen and Goodman 1998 Goodman 2001 Rosenfeld 2000 for good overviews . Most methods take a rather ad hoc approach where n-gram probabilities for various values of n are combined together using either interpolation or back-off schemes. Though some of these methods are intuitively appealing the main justification has

TỪ KHÓA LIÊN QUAN