Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "K-means Clustering with Feature Hashing"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. | K-means Clustering with Feature Hashing Hajime Senuma Department of Computer Science University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan hajime.senuma@gmail.com Abstract One of the major problems of K-means is that one must use dense vectors for its centroids and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing Weinberger et al. 2009 a dimension-reduction technique which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to K-means by showing how much will the objective of K-means be additively distorted. Furthermore to empirically verify our method we experimented on a document clustering task. 1 Introduction In natural language processing NLP and text mining clustering methods are crucial for various tasks such as document clustering. Among them K-means MacQueen 1967 Lloyd 1982 is the most important flat clustering algorithm Manning et al. 2008 both for its simplicity and performance. One of the major problems of K-means is that it has K centroids which are dense vectors where K is the number of clusters. Thus it is infeasible to store them in memory and slow to compute if the dimension of inputs is huge as is often the case with NLP and text mining tasks. A well-known heuristic is truncating after the most significant features Manning et al. 2008 but it is difficult to analyze its effect and to determine which features are significant. 122 Recently Weinberger et al. 2009 introduced feature hashing a simple yet effective and analyzable dimension-reduction technique for large-scale multitask learning. The idea is to combine features which have the same hash value. For example given a hash function h and a vector x if h 1012 h 41234 42 we make a new vector y by setting 12 X1012 X41234 or equally possibly x1012 x41234 x1012 x41234 or