tailieunhanh - Choosing seeds for semi-supervised graph based clustering

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information to boost the performance of clustering process, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). | Journal of Computer Science and Cybernetics 2019 373-384 DOI 1813-9663 35 4 14123 CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING CUONG LE1 VIET-VU VU1 LE THI KIEU OANH2 NGUYEN THI HAI YEN3 1 VNU Information Technology Institute Vietnam National University Hanoi 2 University of Economic and Technical Industries 3 Hanoi Procuratorate University vuvietvu@ Crossref Similarity Check Abstract. Though clustering algorithms have long history nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network electronic commerce GIS etc. Recently semi-supervised clustering for example semi-supervised K-Means semi-supervised DBSCAN semi-supervised graph-based clustering SSGC etc. which uses side information to boost the performance of clustering process has received a great deal of attention. Generally there are two forms of side information seed form labeled data and constraint form must-link cannot-link . By integrating information provided by the user or domain expert the semi-supervised clustering can produce expected results of users. In fact clustering results usually depend on side information provided so different side information will produce different results. In some cases the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of choosing seeds for semi-supervised clustering especially for graph based clustering by seeding SSGC . The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from users. For this purpose we propose an active learning algorithm called SKMMM for the seeds collection task which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on some real data sets from UCI and a real collected document data set show the effectiveness of our approach .

TỪ KHÓA LIÊN QUAN