tailieunhanh - Báo cáo khoa học: "Ensemble Document Clustering Using Weighted Hypergraph Generated by NMF"

In this paper, we propose a new ensemble document clustering method. The novelty of our method is the use of Non-negative Matrix Factorization (NMF) in the generation phase and a weighted hypergraph in the integration phase. In our experiment, we compared our method with some clustering methods. Our method achieved the best results. | Ensemble Document Clustering Using Weighted Hypergraph Generated by NMF Hiroyuki Shinnou Minoru Sasaki Ibaraki University 4-12-1 Nakanarusawa Hitachi Ibaraki Japan 316-8511 shinnou msasaki @ Abstract In this paper we propose a new ensemble document clustering method. The novelty of our method is the use of Non-negative Matrix Factorization NMF in the generation phase and a weighted hypergraph in the integration phase. In our experiment we compared our method with some clustering methods. Our method achieved the best results. 1 Introduction In this paper we propose a new ensemble document clustering method using Non-negative Matrix Factorization NMF in the generation phase and a weighted hypergraph in the integration phase. Document clustering is the task of dividing a document s data set into groups based on document similarity. This is the basic intelligent procedure and is important in text mining systems M. W. Berry 2003 . As the specific application relevant feedback in IR where retrieved documents are clustered is actively researched Hearst and Pedersen 1996 Kummamuru et al. 2004 . In document clustering the document is represented as a vector which typically uses the bag of word model and the TF-IDF term weight. A vector represented in this manner is highly dimensional and sparse. Thus in document clustering a dimensional reduction method such as PCA or SVD is applied before actual clustering Boley et al. 1999 Deerwester et al. 1990 . Dimensional reduction maps data in a high-dimensional space into a 77 low-dimensional space and improves both clustering accuracy and speed. NMF is a dimensional reduction method Xu et al. 2003 that is based on the aspect model used in the Probabilistic Latent Semantic Indexing Hofmann 1999 . Because the axis in the reduced space by NMF corresponds to a topic the reduced vector represents the clustering result. For a given termdocument matrix and cluster number we can obtain the NMF result with an iterative .