tailieunhanh - Báo cáo khoa học: "Bilingual-LSA Based LM Adaptation for Spoken Language Translation"

We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to the target language N-gram LM via marginal adaptation. . | Bilingual-LSA Based LM Adaptation for Spoken Language Translation Yik-Cheung Tam and Ian Lane and Tanja Schultz InterACT Language Technologies Institute Carnegie Mellon University Pittsburgh PA 15213 yct tanj a @ Abstract We propose a novel approach to crosslingual language model LM adaptation based on bilingual Latent Semantic Analysis bLSA . A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by first inferring the topic posterior distribution of the source text and then applying the inferred distribution to the target language N-gram LM via marginal adaptation. The proposed framework also enables rapid bootstrapping of LSA models for new languages based on a source LSA model from another language. On Chinese to English speech and text translation the proposed bLSA framework successfully reduced word perplexity of the English LM by over 27 for a unigram LM and up to for a 4-gram LM. Furthermore the proposed approach consistently improved machine translation quality on both speech and text based adaptation. 1 Introduction Language model adaptation is crucial to numerous speech and translation tasks as it enables higher-level contextual information to be effectively incorporated into a background LM improving recognition or translation performance. One approach is 520 to employ Latent Semantic Analysis LSA to capture in-domain word unigram distributions which are then integrated into the background N-gram LM. This approach has been successfully applied in automatic speech recognition ASR Tam and Schultz 2006 using the Latent Dirichlet Allocation LDA Blei et al. 2003 . The LDA model can be viewed as a Bayesian topic mixture model with the topic mixture weights drawn from a Dirichlet distribution. For LM adaptation the topic mixture weights .

TÀI LIỆU LIÊN QUAN