tailieunhanh - Báo cáo khoa học: "Searching for Topics in a Large Collection of Texts"

We describe an original method that automatically finds specific topics in a large collection of texts. Each topic is first identified as a specific cluster of texts and then represented as a virtual concept, which is a weighted mixture of words. Our intention is to employ these virtual concepts in document indexing. In this paper we show some preliminary experimental results and discuss directions of future work. | Searching for Topics in a Large Collection of Texts Martin Holub Jiri Semecky Jiri Divis Center for Computational Linguistics Charles University Prague holub semecky @ Abstract We describe an original method that automatically finds specific topics in a large collection of texts. Each topic is first identified as a specific cluster of texts and then represented as a virtual concept which is a weighted mixture of words. Our intention is to employ these virtual concepts in document indexing. In this paper we show some preliminary experimental results and discuss directions of future work. 1 Introduction In the field of information retrieval for a detailed survey see . Baeza-Yates and Ribeiro-Neto 1999 document indexing and representing documents as vectors belongs among the most successful techniques. Within the framework of the well known vector model the indexed elements are usually individual words which leads to high dimensional vectors. However there are several approaches that try to reduce the high dimensionality of the vectors in order to improve the effec-tivity of retrieving. The most famous is probably the method called Latent Semantic Indexing LSI introduced by Deerwester et al. 1990 which employs a specific linear transformation of original word-based vectors using a system of latent semantic concepts . Other two approaches which inspired us namely Dhillon and Modha 2001 and Torkkola 2002 are similar to LSI but dif ferent in the way how they project the vectors of documents into a space of a lower dimension. Our idea is to establish a system of virtual concepts which are linear functions represented by vectors extracted from automatically discovered concept-formative clusters of documents. Shortly speaking concept-formative clusters are semantically coherent and specific sets of documents which represent specific topics. This idea was originally proposed by Holub 2003 who hypothesizes that concept-oriented vector .

TỪ KHÓA LIÊN QUAN