tailieunhanh - Báo cáo khoa học: "Learning Bigrams from Unigrams"

Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bagof-words documents. In experiments on seven corpora, we observed that our learned bigram language models: i) achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained. | Learning Bigrams from Unigrams Xiaojin Zhu and Andrew B. Goldberg and Michael Rabbat and Robert Nowak Department of Computer Sciences University of Wisconsin-Madison Department of Electrical and Computer Engineering McGill University Department of Electrical and Computer Engineering University of Wisconsin-Madison jerryzhu goldberg @ nowak@ Abstract Traditional wisdom holds that once documents are turned into bag-of-words unigram count vectors word orders are completely lost. We introduce an approach that perhaps surprisingly is able to learn a bigram language model from a set of bag-of-words documents. At its heart our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bag-of-words documents. In experiments on seven corpora we observed that our learned bigram language models i achieve better test set perplexity than unigram models trained on the same bag-of-words documents and are not far behind oracle bigram models trained on the corresponding ordered documents ii assign higher probabilities to sensible bigram word pairs iii improve the accuracy of ordered-document recovery from a bag-of-words. Our approach opens the door to novel phenomena for example privacy leakage from index files. 1 Introduction A bag-of-words BOW is a basic document representation in natural language processing. In this paper we consider a BOW in its simplest form . a unigram count vector or word histogram over the vocabulary. When performing the counting word order is ignored. For example the phrases really neat and neat really contribute equally to a BOW. Obviously once a set of documents is turned into a set of BOWs the word order information within them is completely lost or is it In this paper we show that one can in fact partly recover the order information. Specifically given a set of documents in unigram-count BOW representation one can recover a non-trivial bigram language .

TÀI LIỆU LIÊN QUAN