Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Probabilistic Latent Semantic Analysis (PLSA) models have been shown to provide a better model for capturing polysemy and synonymy than Latent Semantic Analysis (LSA). However, the parameters of a PLSA model are trained using the Expectation Maximization (EM) algorithm, and as a result, the trained model is dependent on the initialization values so that performance can be highly variable. In this paper we present a method for using LSA analysis to initialize a PLSA model. We also investigated the performance of our method for the tasks of text segmentation and retrieval on personal-size corpora, and present results demonstrating the. | Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Ayman Farahat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto CA 94304 ayman.farahat@gmail.com Francine Chen Palo Alto Research Center 3333 Coyote Hill Road Palo Alto CA 94304 chen@fxpal.com Abstract Probabilistic Latent Semantic Analysis PLSA models have been shown to provide a better model for capturing polysemy and synonymy than Latent Semantic Analysis LSA . However the parameters of a PLSA model are trained using the Expectation Maximization EM algorithm and as a result the trained model is dependent on the initialization values so that performance can be highly variable. In this paper we present a method for using LSA analysis to initialize a PLSA model. We also investigated the performance of our method for the tasks of text segmentation and retrieval on personal-size corpora and present results demonstrating the efficacy of our proposed approach. 1 Introduction In modeling a collection of documents for information access applications the documents are often represented as a bag of words i.e. as term vectors composed of the terms and corresponding counts for each document. The term vectors for a document collection can be organized into a term by document co-occurrence matrix. When directly using these representations synonyms and polysemous terms that is terms with multiple senses or meanings are not handled well. Methods for smoothing the term distributions through the use of latent classes have been shown to improve the performance of a number of information access tasks including retrieval over smaller collections Deerwester et al. 1990 text segmentation Brants et al. 2002 and text classification Wu and Gunopulos 2002 . The Probabilistic Latent Semantic Analysis model PLSA Hofmann 1999 provides a probabilistic framework that attempts to capture polysemy and synonymy in text for applications such as retrieval and segmentation. It uses a mixture decomposition to .