tailieunhanh - Báo cáo khoa học: "Document Classification Using a Finite Mixture Model"

We propose a new method of classifying documents into categories. We define for each category a finite mixture model based on soft clustering of words. We treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models, and employ the EM algorithm to efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms existing methods. | Document Classification Using a Finite Mixture Model Hang Li Kenji Yamanishi c c Res. Labs. NEC 4-1-1 Miyazaki Miyamae-ku Kawasaki 216 Japan Email lihang yamanisi @ Abstract We propose a new method of classifying documents into categories. We define for each category a finite mixture model based on soft clustering of words. We treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models and employ the EM algorithm to efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms existing methods. 1 Introduction We are concerned here with the issue of classifying documents into categories. More precisely we begin with a number of categories . tennis soccer skiing each already containing certain documents. Our goal is to determine into which categories newly given documents ought to be assigned and to do so on the basis of the distribution of each document s Many methods have been proposed to address this issue and a number of them have proved to be quite effective . Apte Damerau and Weiss 1994 Cohen and Singer 1996 Lewis 1992 Lewis and Ringuette 1994 Lewis et al. 1996 Schutze Hull and Pedersen 1995 Yang and Chute 1994 . The simple method of conducting hypothesis testing over word-based distributions in categories defined in Section 2 is not efficient in storage and suffers from the data sparseness problem . the number of parameters in the distributions is large and the data size is not sufficiently large for accurately estimating them. In order to address this difficulty Guthrie Walker and Guthrie 1994 have proposed using distributions based on what we refer to as hard A related issue is the retrieval from a data base of documents which are relevant to a given query pseudodocument . Deerwester et al. 1990 Fuhr 1989 Robertson and Jones 1976 Salton and McGill 1983 Wong and Yao 1989 . clustering of words . in which a

TỪ KHÓA LIÊN QUAN