tailieunhanh - A clustering technique for the Vietnamese word categorization
A clustering technique for the Vietnamese word categorization. In natural language processing, part-of-speech (POS) tagging plays an important role, as its output is the input of many other tasks (syntax analysis, semantic analysis. . . ). One of the problems related to POS tagging is to define the POS set. This could be solved using unsupervised machine learning methods. | TẠP CHÍ KHOA HỌC ĐẠI HỌC ĐÀ LẠT Tập 6, Số 2, 2016 207–218 207 A CLUSTERING TECHNIQUE FOR THE VIETNAMESE WORD CATEGORIZATION Nguyen Minh Hiepa*, Nguyen Thi Minh Huyenb, Ngo The Quyenb, Tran Thi Phuong Linha a The Faculty of Information Technology, Dalat University, Lamdong, Vietnam b The Faculty of Informatics, VNU University of Science, Hanoi, Vietnam Article history Received: January 04th, 2016 Received in revised form: March 10th, 2016 Accepted: March 16th, 2016 Abstract In natural language processing, part-of-speech (POS) tagging plays an important role, as its output is the input of many other tasks (syntax analysis, semantic analysis. . . ). One of the problems related to POS tagging is to define the POS set. This could be solved using unsupervised machine learning methods. This paper presents an application of the DBSCAN clustering algorithm to classify Vietnamese words from a large corpus. The features used to characterize each word are naturally defined by the context of that word in a sentence. We use a large corpus containing sentences automatically extracted from the online Nhan Dan newspaper. Keywords: Clustering; Corpus; DBSCAN; POS; POS tagging; Tag set. 1. INTRODUCTION The question of Vietnamese word classification has been discussed in several linguistic studies [1]. This problem can be solved by the method called unsupervised machine learning method. We present technique that clusters Vietnamese words from a store of documents in the order to identify a tagged lexical class. The feature which is used to cluster words is the context of this word in the sentence. The algorithm DBSCAN is used to cluster words. Data training are automatically clustered big size Vietnamese document store from Nhan Dan online and Thanh Nien online newspapers. * Corresponding author: Email: hiepnm@ TẠP CHÍ KHOA HỌC ĐẠI HỌC ĐÀ LẠT [ĐẶC SAN CÔNG NGHỆ THÔNG TIN] 208 This article comprises three parts. Part 1 introduces the research motivation .
đang nạp các trang xem trước