tailieunhanh - Báo cáo khoa học: "Pseudo-word for Phrase-based Machine Translation"

The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus. But word appears to be too fine-grained in some cases such as non-compositional phrasal equivalences, where no clear word alignments exist. Using words as inputs to PBSMT pipeline has inborn deficiency. This paper proposes pseudo-word as a new start point for PB-SMT pipeline. | Pseudo-word for Phrase-based Machine Translation Xiangyu Duan Min Zhang Haizhou Li Institute for Infocomm Research A-STAR Singapore Xduan mzhang hli @ Abstract The pipeline of most Phrase-Based Statistical Machine Translation PB-SMT systems starts from automatically word aligned parallel corpus. But word appears to be too fine-grained in some cases such as non-compositional phrasal equivalences where no clear word alignments exist. Using words as inputs to PB-SMT pipeline has inborn deficiency. This paper proposes pseudo-word as a new start point for PB-SMT pipeline. Pseudo-word is a kind of basic multi-word expression that characterizes minimal sequence of consecutive words in sense of translation. By casting pseudo-word searching problem into a parsing framework we search for pseudo-words in a monolingual way and a bilingual synchronous way. Experiments show that pseudo-word significantly outperforms word for PB-SMT model in both travel translation domain and news translation domain. 1 Introduction The pipeline of most Phrase-Based Statistical Machine Translation PB-SMT systems starts from automatically word aligned parallel corpus generated from word-based models Brown et al. 1993 proceeds with step of induction of phrase table Koehn et al. 2003 or synchronous grammar Chiang 2007 and with model weights tuning step. Words are taken as inputs to PB-SMT at the very beginning of the pipeline. But there is a deficiency in such manner that word is too finegrained in some cases such as non-compositional phrasal equivalences where clear word alignments do not exist. For example in Chinese-to-English translation and would like to constitute a 1-to-n phrasal equivalence f and how much is it constitute a m-to-n phrasal equivalence. No clear word alignments are there in such phrasal equivalences. Moreover should basic translational unit be word or coarsegrained multi-word is an open problem for optimizing SMT models. Some researchers have explored .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN