tailieunhanh - Báo cáo khoa học: "Modified Distortion Matrices for Phrase-Based Statistical Machine Translation"
This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns, and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. | Modified Distortion Matrices for Phrase-Based Statistical Machine Translation Arianna Bisazza and Marcello Federico Fondazione Bruno Kessler Trento Italy bisazza federico @ Abstract This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. Finally we encode these reorderings by modifying selected entries of the distortion cost matrix on a per-sentence basis. In this way we expand the search space by a much finer degree than if we simply raised the distortion limit. The proposed techniques are tested on Arabic-English and German-English using well-known SMT benchmarks. 1 Introduction Despite the large research effort devoted to the modeling of word reordering this remains one of the main obstacles to the development of accurate SMT systems for many language pairs. On one hand the phrase-based approach PSMT Och 2002 Zens et al. 2002 Koehn et al. 2003 with its shallow and loose modeling of linguistic equivalences appears as the most competitive choice for closely related language pairs with similar clause structures both in terms of quality and of efficiency. On the other tree-based approaches Wu 1997 Yamada 2002 Chiang 2005 gain advantage at the cost of higher complexity and isomorphism assumptions on language pairs with radically different word orders. Lying between these two extremes are language pairs where most of the reordering happens locally 478 and where long reorderings can be isolated and described by a handful of linguistic rules. Notable examples are the family-unrelated Arabic-English and the related German-English language pairs. Interestingly on these pairs PSMT generally prevails over tree-based SMT1 producing overall high-quality outputs and
đang nạp các trang xem trước