tailieunhanh - Báo cáo khoa học: "A STOCHASTIC FINITE-STATE WORD-SEGMENTATIONAL GORITHM FOR CHINESE"

We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation. | A STOCHASTIC FINITE-STATE WORD-SEGMENTATION ALGORITHM FOR CHINESE Richard Sproat Chilin Shih William Gale AT T Bell Laboratories 600 Mountain Avenue Room 2d-451 2d-453 2c-278 Murray Hill NJ USA 07974-0636 rws cis gale @ Nancy Chang Harvard University Division of Applied Sciences Harvard University Cambridge MA 02138 nchang@ Abstract We present a stochastic finite-state model for segmenting Chinese text into dictionary entties and productively derived words and providing pronunciations for these words the method incorporates a class-based model in its treatment of personal names. We also evaluate the system s performance taking into account the fact that people often do not agree on a single segmentation. THE PROBLEM The initial step of any text analysis task is the tok-enization of the input into words. For many writing systems using whitespace as a delimiter for words yields reasonable results. However for Chinese and other systems where whitespace is not used to delimit words such trivial schemes will not work. Chinese writing is morphosyllabic DeFrancis 1984 meaning that each hanzi - Chinese character - nearly always represents a single syllable that is usually also a single morpheme. Since in Chinese as in English words may be polysyllabic and since hanzi are written with no intervening spaces it is not trivial to reconstruct which hanzi to group into words. While for some applications it may be possible to bypass the word-segmentation problem and work straight from hanzi there are several reasons why this approach will not work in a text-to-speech TTS system for Mandarin Chinese the primary intended application of our segmenter. These reasons include 1. Many hanzi are homographs whose pronunciation depends upon word affiliation. So È9 is pronounced deO1 when it is a prenominal modification marker but di4 in the word É 1 mu4di4 goal 2 is normally ganl dry but qian2 in aperson s given name. 2. Some phonological rules depend upon .