tailieunhanh - Báo cáo khoa học: "Cross-Lingual Latent Topic Extraction"

Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. . | Cross-Lingual Latent Topic Extraction Duo Zhang University of Illinois at Urbana-Champaign dzhang22@ Qiaozhu Mei University of Michigan qmei@ ChengXiang Zhai University of Illinois at Urbana-Champaign czhai@ Abstract Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models though is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis PCLSA which extends the Probabilistic Latent Semantic Analysis PLSA model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data. 1 Introduction As a robust unsupervised way to perform shallow latent semantic analysis of topics in text probabilistic topic models Hofmann 1999a Blei et al. 2003b have recently attracted much attention. The common idea behind these models is the following. A topic is represented by a multinomial word distribution so that words characterizing a topic generally have higher probabilities than other words. We can then hypothesize the existence of multiple topics in text and define a generative model based on the hypothesized topics. By fitting the model to text data we can obtain an estimate of all the word distributions corresponding to the latent topics as well as the topic distributions in text. Intuitively the learned word .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN