tailieunhanh - Báo cáo khoa học: "PRECISE N-GRAM PROBABILITIES FROM STOCHASTIC CONTEXT-FREE GRAMMARS"
We present an algorithm for computing n-gram probabilities from stochastic context-free grammars, a procedure that can alleviate some of the standard problems associated with n-grams (estimation from sparse data, lack of linguistic structure, among others). The method operates via the computation of substring expectations, which in turn is accomplished by solving systems of linear equations derived from the grammar. The procedure is fully implemented and has proved viable and useful in practice. confirming its practical feasibility and utility. The technique of compiling higher-level grammatical models into lower-level ones has precedents: Zue et al. (1991) report building a word-pair grammar. | PRECISE TV-GRAM PROBABILITIES FROM STOCHASTIC CONTEXT-FREE GRAMMARS Andreas Stolcke and Jonathan Segal University of California Berkeley and International Computer Science Institute 1947 Center Street Berkeley CA 94704 stolcke j segal @ Abstract We present an algorithm for computing n-gram probabilities from stochastic context-free grammars a procedure that can alleviate some of the standard problems associated with n-grams estimation from sparse data lack of linguistic structure among others . The method operates via the computation of substring expectations which in turn is accomplished by solving systems of linear equations derived from the grammar. The procedure is fully implemented and has proved viable and useful in practice. INTRODUCTION Probabilistic language modeling with n-gram grammars particularly bigram and trigram has proven extremely useful for such tasks as automated speech recognition part-of-speech tagging and word-sense disambiguation and lead to simple efficient algorithms. Unfortunately working with these grammars can be problematic for several reasons they have large numbers of parameters so reliable estimation requires a very large training corpus and or sophisticated smoothing techniques Church and Gale 1991 it is very hard to directly model linguistic knowledge and thus these grammars are practically incomprehensible to human inspection and the models are not easily extensible . if a new word is added to the vocabulary none of the information contained in an existing n-gram will tell anything about the n-grams containing the new item. Stochastic context-free grammars SCFGs on the other hand are not as susceptible to these problems they have many fewer parameters so can be reasonably trained with smaller corpora they capture linguistic generalizations and are easily understood and written by linguists and they can be extended straightforwardly based on the underlying linguistic knowledge. In this paper we present a .
đang nạp các trang xem trước