tailieunhanh - Báo cáo khoa học: "From Bilingual Dictionaries to Interlingual Document Representations"Raghavendra Udupa Micros

Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. | From Bilingual Dictionaries to Interlingual Document Representations Jagadeesh Jagarlamudi University of Maryland College Park USA jags@ Hal Daume III University of Maryland College Park USA hal@ Raghavendra Udupa Microsoft Research India Bangalore India raghavu@ Abstract Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation making them sensitive to the domain of the training data. In this paper we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy we develop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis CCA trained on a different domain. 1 Introduction The growth of text corpora in different languages poses an inherent problem of aligning documents across languages. Obtaining an explicit alignment or a different way of bridging the language barrier is an important step in many natural language processing NLP applications such as document retrieval Gale and Church 1991 Rapp 1999 Ballesteros and Croft 1996 Munteanu and Marcu 2005 Vu et al. 2009 Transliteration Mining Klementiev and Roth 2006 Hermjakob et al. 2008 Udupa et al. 2009 Ravi and Knight 2009 and Multilingual Web Search Gao et al. 2008 Gao et al. 2009 . 147 Aligning documents from different languages arises in all the above mentioned problems. In this paper we address this problem by mapping documents into a common subspace interlingual representa-tion 1 . This common subspace

TỪ KHÓA LIÊN QUAN