tailieunhanh - Báo cáo khoa học: "Learning Bilingual Lexicons from Monolingual Corpora"

We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types. | Learning Bilingual Lexicons from Monolingual Corpora Aria Haghighi Percy Liang Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California at Berkeley aria42 pliang tberg klein @ Abstract We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types. 1 Introduction Current statistical machine translation systems use parallel corpora to induce translation correspondences whether those correspondences be at the level of phrases Koehn 2004 treelets Galley et al. 2006 or simply single words Brown et al. 1994 . Although parallel text is plentiful for some language pairs such as English-Chinese or English-Arabic it is scarce or even non-existent for most others such as English-Hindi or French-Japanese. Moreover parallel text could be scarce for a language pair even if monolingual data is readily available for both languages. In this paper we consider the problem of learning translations from monolingual sources alone. This task though clearly more difficult than the standard parallel text approach can operate on language pairs and in domains where standard approaches cannot. We take as input two monolingual corpora and perhaps some seed translations and we produce as output a bilingual lexicon defined as a list of word pairs deemed to be word-level translations. Precision and recall are then measured over these bilingual lexicons. This setting has been considered before most notably in Koehn and Knight 2002 and Fung 1995 but the current paper is the first to use a probabilistic model and present results