Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Robust Machine Translation Evaluation with Entailment Features∗"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-theart scores (BLEU, NIST, TER, and METEOR) in two different settings. . | Robust Machine Translation Evaluation with Entailment Features Sebastian Pado Stuttgart University pado@ims.uni-stuttgart.de Michel Galley Dan Jurafsky Chris Manning Stanford University mgalley jurafsky manning @stanford.edu Abstract Existing evaluation metrics for machine translation lack crucial robustness their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning A good translation candidate means the same thing as the reference translation regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment such as lexical-semantic in- compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-the-art scores BLEU NIST TER and METEOR in two different settings. The combination metric outperforms the individual scores but is bested by the entailment-based metric. Combining the entailment and traditional features yields further improvements. 1 Introduction Constant evaluation is vital to the progress of machine translation MT . Since human evaluation is costly and difficult to do reliably a major focus of research has been on automatic measures of MT quality pioneered by BLEU Papineni et al. 2002 and NIST Doddington 2002 . BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. The resulting scores are cheap and objective. However studies such as Callison-Burch et al. 2006 have identified a number of problems with BLEU and related n-gram-based scores 1 BLEU-like metrics are unreliable at the level of individual sentences due to data sparsity 2 BLEU metrics can be gamed by permuting word order 3 for some corpora and languages the correlation to human ratings is very low even at the system