tailieunhanh - Báo cáo khoa học: "Phrase Table Training For Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?"

In this work, the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log-linear model aiming for a balanced precision and recall. We present a generic phrase training algorithm which is parameterized with feature functions and can be optimized jointly with the translation engine to directly maximize the end-to-end system performance. Multiple data-driven feature functions are proposed to capture the quality and confidence of phrases and phrase pairs. Experimental results demonstrate consistent and significant improvement over the widely used method that is based on word alignment matrix only. . | Phrase Table Training For Precision and Recall What Makes a Good Phrase and a Good Phrase Pair Yonggang Deng Jia Xu and Yuqing Gao IBM . Watson Research Center Yorktown Heights NY 10598 USA ydeng yuqing @ Chair of Computer Science VI RWTH Aachen University D-52056 Aachen Germany xujia@ Abstract In this work the problem of extracting phrase translation is formulated as an information retrieval process implemented with a log-linear model aiming for a balanced precision and recall. We present a generic phrase training algorithm which is parameterized with feature functions and can be optimized jointly with the translation engine to directly maximize the end-to-end system performance. Multiple data-driven feature functions are proposed to capture the quality and confidence of phrases and phrase pairs. Experimental results demonstrate consistent and significant improvement over the widely used method that is based on word alignment matrix only. 1 Introduction Phrase has become the standard basic translation unit in Statistical Machine Translation SMT since it naturally captures context dependency and models internal word reordering. In a phrase-based SMT system the phrase translation table is the defining component which specifies alternative translations and their probabilities for a given source phrase. In learning such a table from parallel corpus two related issues need to be addressed either separately or jointly which pairs are considered valid translations and how to assign weights such as probabilities to them. The first problem is referred to as phrase pair extraction which identifies phrase pairs that are supposed to be translations of each other. Methods have been proposed based on syntax that take advantage of linguistic constraints and alignment of grammatical structure such as in Yamada and Knight 2001 and Wu 1995 . The most widely used approach derives phrase pairs from word alignment matrix Och and Ney 2003 Koehn et al. 2003