tailieunhanh - Báo cáo khoa học: "Word Association and MI-Trigger-based Language Modeling"
In Chinese, a word is made up of one or more characters. Hence, there also exists preferred relationships between Chinese characters. [Sproat+90] employed a statistical method to group neighboring Chinese characters in a sentence into two-character words by making use of a measure of character association based on mutual information. Here, we will focus instead on the preferred relationships between words. The preference relationships between words can expand from a short to long distance. | Word Association and MI-Trigger-based Language Modeling GuoDong ZHOU KimTeng LUA Department of Information Systems and Computer Science National University of Singapore Singapore 119260 zhougd luakt @ Abstract There exists strong word association in natural language. Based on mutual information this paper proposes a new MI-Trigger-based modeling approach to capture the preferred relationships between words over a short or long distance. Both the distance-independent DI and distance-dependent DD MI-Trigger-based models are constructed within a window. It is found that proper Mi-Trigger modeling is superior to word bigram model and the DD Mi-Trigger models have better performance than the DI Mi-Trigger models for the same window size. It is also found that the number of the trigger pairs in an Mi-Trigger model can be kept to a reasonable size without losing too much of its modeling power. Finally it is concluded that the preferred relationships between words are useful to language disambiguation and can be modeled efficiently by the MI-Trigger-based modeling approach. Introduction In natural language there always exist many preferred relationships between words. Lexicographers always use the concepts of collocation co-occurrence and lexis to describe them. Psychologists also have a similar concept word association. Two highly associated word pairs are not only but also and doctor nurse . Psychological experiments in Meyer 75 indicated that the human s reaction to a highly associated word pair was stronger and faster than that to a poorly associated word pair. The strength of word association can be measured by mutual information. By computing mutual information of a word pair we can get many useful preference information from the corpus such as the semantic preference between noun and noun . doctor nurse the particular preference between adjective and noun . strong currency and solid structure . pay attention Calzolori90 These information are
đang nạp các trang xem trước