tailieunhanh - Báo cáo khoa học: "Improved Source-Channel Models for Chinese Word Segmentation"
The source model is used to estimate the generative probability of a word sequence, in which each word belongs to one word type. For each word type, a channel model is used to estimate the generative probability of a character string given the word type. So there are multiple channel models. We shall show in this paper that our models provide a statistical framework to corporate a wide variety linguistic knowledge and statistical models in a unified way. We evaluate the performance of our system using an annotated test set. . | Improved Source-Channel Models for Chinese Word Segmentation1 Jianfeng Gao Mu Li and Chang-Ning Huang Microsoft Research Asia Beijing 100080 China jfgao t-muli cnhuang @ Abstract This paper presents a Chinese word segmentation system that uses improved sourcechannel models of Chinese sentence generation. Chinese words are defined as one of the following four types lexicon words morphologically derived words factoids and named entities. Our system provides a unified approach to the four fundamental features of word-level Chinese language processing 1 word segmentation 2 morphological analysis 3 factoid detection and 4 named entity recognition. The performance of the system is evaluated on a manually annotated test set and is also compared with several state-of-the-art systems taking into account the fact that the definition of Chinese words often varies from system to system. 1 Introduction Chinese word segmentation is the initial step of many Chinese language processing tasks and has attracted a lot of attention in the research community. It is a challenging problem due to the fact that there is no standard definition of Chinese words. In this paper we define Chinese words as one of the following four types entries in a lexicon morphologically derived words factoids and named entities. We then present a Chinese word segmentation system which provides a solution to the four fundamental problems of word-level Chinese language processing word segmentation morphological analysis factoid detection and named entity recognition NER . There are no word boundaries in written Chinese text. Therefore unlike English it may not be desirable to separate the solution to word segmentation from the solutions to the other three problems. Ideally we would like to propose a unified approach to all the four problems. The unified approach we used in our system is based on the improved source-channel models of Chinese sentence generation with two components a source model .
đang nạp các trang xem trước