tailieunhanh - Báo cáo khoa học: "A Stochastic Language Model using Dependency and Its Improvement by Word Clustering"

In this paper, we present a stochastic language model for Japanese using dependency. The prediction unit in this model is all attribute of "bunsetsu". This is represented by the product of the head of content words and that of function words. The relation between the attributes of "bunsetsu" is ruled by a context-free grammar. The word sequences axe predicted from the attribute using word n-gram model. The spell of U n k n o w word is predicted using character n-grain model. | A Stochastic Language Model using Dependency and Its Improvement by Word Clustering Shinsuke Mori Tokyo Research Labolatory IBM Japan Ltd. 1623-14 Shimotsuruma Yamatoshi Japan Makoto Nagao Kyoto University Yoshida-honmachi Sakyo Kyoto Japan Abstract In this paper we present a stochastic language model for Japanese using dependency. The prediction unit in this model is an attribute of bunsetsu . This is represented by the product of the head of content words and that of function words. The relation between the attributes of bunsetsu is ruled by a context-free grammar. The word sequences are predicted from the attribute using word n-gram model. The spell of Unknow word is predicted using character n-gram model. This model is robust in that it can compute the probability of an arbitrary string and is complete in that it models from unknown word to dependency at the same time. 1 Introduction An effectiveness of stochastic language modeling as a methodology of natural language processing has been attested by various applications to the recognition system such as speech recognition and to the analysis system such as part-of-speech POS tagger. In this methodology a stochastic language model with some parameters is built and they are estimated in order to maximize its prediction power minimize the cross entropy on an unknown input. Considering a single application it might be better to estimate the parameters taking account of expected accuracy of recognition or analysis. This method is however heavily dependent on the problem and offers no systematic solution as far as we know. The methodology of stochastic language modeling however allows us to separate from various frameworks of natural language processing the language description model common to them and enables us a systematic improvement of each application. In this framework a description on a language is represented as a map from a sequence of alphabetic characters to a probability value. The first model is c. E. .