tailieunhanh - Phrasal semantic distance for Vietnamese textual document retrieval

In this paper, a computational semantic method is proposed to estimate the phrasal semantic distance used in our model of Vietnamese document retrieval system. The semantic distances between phrases are defined in terms of semantic classes and semantic relations to ensure that it can reflect how different two certain phrases are. | Journal of Computer Science and Cybernetics, , (2015), 185– 202 DOI: PHRASAL SEMANTIC DISTANCE FOR VIETNAMESE TEXTUAL DOCUMENT RETRIEVAL DO THI THANH TUYEN† AND NGUYEN TUAN DANG‡ University of Information Technology, VNU-HCM; † tuyendtt@; ‡ dangnt@ Abstract. In this paper, a computational semantic method is proposed to estimate the phrasal semantic distance used in our model of Vietnamese document retrieval system. The semantic distances between phrases are defined in terms of semantic classes and semantic relations to ensure that it can reflect how different two certain phrases are. To estimate the semantic distance, the semantic classes of a phase are identified by using the n-gram model. After identification of the semantic classes, their semantic relations are also identified by using a Vietnamese Lexicon Ontology. This handcrafted ontology contains defined semantic classes and their potential relations in Vietnamese language explicitly. For the evaluation purpose, a phrasal semantic retrieval system has been built to test with a data set of 720 phrases and 30 queries. The evaluation shows the precision of and the recall of on experiment results. Keywords. Lexicon ontology, phrasal semantic analysis, semantic class, semantic distance, semantic information retrieval. 1. INTRODUCTION Actually, most approaches of modern information retrieval systems are aimed at exploiting semantic features of phrases in both documents and queries to identify which documents are relevant to the user’s needs. In fact, the systems conceived by such approaches are called “semantic information retrieval systems”, which are distinguished from the other information retrieval systems working with documents of semantic web standard as in [1, 2]. In an information retrieval system, the key problem is how to estimate the “semantic similarity” between a keywords based query and each text document. To solve this problem, the