tailieunhanh - Báo cáo khoa học: "Partial Matching Strategy for Phrase-based Statistical Machine Translation"
This paper presents a partial matching strategy for phrase-based statistical machine translation (PBSMT). Source phrases which do not appear in the training corpus can be translated by word substitution according to partially matched phrases. The advantage of this method is that it can alleviate the data sparseness problem if the amount of bilingual corpus is limited. | Partial Matching Strategy for Phrase-based Statistical Machine Translation Zhongjun He1 2 and Qun Liu1 and Shouxun Lin1 1 Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China 2 Graduate University of Chinese Academy of Sciences Beijing 100049 China zjhe liuqun sxlin @ Abstract This paper presents a partial matching strategy for phrase-based statistical machine translation PBSMT . Source phrases which do not appear in the training corpus can be translated by word substitution according to partially matched phrases. The advantage of this method is that it can alleviate the data sparseness problem if the amount of bilingual corpus is limited. We incorporate our approach into the state-of-the-art PBSMT system Moses and achieve statistically significant improvements on both small and large corpora. 1 Introduction Currently most of the phrase-based statistical machine translation PBSMT models Marcu and Wong 2002 Koehn et al. 2003 adopt full matching strategy for phrase translation which means that a phrase pair f e can be used for translating a source phrase only if f f. Due to lack of generalization ability the full matching strategy has some limitations. On one hand the data sparseness problem is serious especially when the amount of the bilingual data is limited. On the other hand for a certain source text the phrase table is redundant since most of the bilingual phrases cannot be fully matched. In this paper we address the problem of translation of unseen phrases the source phrases that are not observed in the training corpus. The alignment template model Och and Ney 2004 enhanced phrasal generalizations by using words classes rather than the words themselves. But the phrases are overly generalized. The hierarchical phrase-based model Chiang 2005 used hierarchical phrase pairs to strengthen the generalization ability of phrases and allow long distance reorderings. However the
đang nạp các trang xem trước