tailieunhanh - Báo cáo khoa học: "Re-evaluating the Role of B LEU in Machine Translation Research"
We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores. | Re-evaluating the Role of Bleu in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW callison-burch@ Abstract We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality and give two significant counterexamples to Bleu s correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores. 1 Introduction Over the past five years progress in machine translation and to a lesser extent progress in natural language generation tasks such as summarization has been driven by optimizing against n-grambased evaluation metrics such as Bleu Papineni et al. 2002 . The statistical machine translation community relies on the Bleu metric for the purposes of evaluating incremental system changes and optimizing systems through minimum error rate training Och 2003 . Conference papers routinely claim improvements in translation quality by reporting improved Bleu scores while neglecting to show any actual example translations. Workshops commonly compare systems using Bleu scores often without confirming these rankings through manual evaluation. All these uses of Bleu are predicated on the assumption that it correlates with human judgments of translation quality which has been shown to hold in many cases Doddington 2002 Coughlin 2003 . However there is a question as to whether minimizing the error rate with respect to Bleu does indeed guarantee genuine translation improvements. If Bleu s correlation with human judgments has been overestimated then the field needs to ask itself whether it should continue to be driven by Bleu to the extent that it currently is. In this paper we give a number
đang nạp các trang xem trước