tailieunhanh - Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks. | Unsupervised Segmentation of Chinese Text . a . Zh huiJin and Kumiko Tanaka-Ishi Graduate School of Information Science and Technology University of Tokyo Abstract We propose an unsupervised segmentation method based on an assumption about language data that the increasing point of ntropy of success veoha-acters 1 the location of a word boundary A large-scale expern ent was conducted by. using 200 MB o unsegmented training data and 1 MB of test data and precision of 90 vasat ained with reca 1 being around 80 . Moreover we found that the precision was s able at around 90 independently of the learning data size. i Introduct ion . The theme of this paper is the following as-sumpdon . The uncertainty o token coming after a sequence helps determine whether a given position is at a boundary. A . Intuitively as illustrated in FigureM the variety of successive tokens at each character inside a word mono onieallv de teases according to the offset length because th longer he preceding character n-gram the longer the p eceding contex and the more 1 restricts the appearance of possible next tokens Forex-ample it is easier o guess wh h character conies after natura than after na . On the other hand the uncertainty at the po ition of a word border becom s greater and the complexity increases as the position is out of context. With the same example it is difficult to guess which character comes after natural . This suggests that a word border can be detected by focusing on the differentials of the uncertainty of branching. In this paper we report our study on applying this assumption to Chinese word seg- Figure ft Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing he uncertainty of su ce sive tokens via the branching ntropy which we mathematically define in the next s ction . Ou Intel ion in this paper is above all to study the fundamental and scientific stat stical property nderly ng language data so that it can be applied to .

TỪ KHÓA LIÊN QUAN