tailieunhanh - Báo cáo khoa học: "The Back-translation Score: Automatic MT Evaluation at the Sentence Level without Reference Translations"
Automatic tools for machine translation (MT) evaluation such as BLEU are well established, but have the drawbacks that they do not perform well at the sentence level and that they presuppose manually translated reference texts. Assuming that the MT system to be evaluated can deal with both directions of a language pair, in this research we suggest to conduct automatic MT evaluation by determining the orthographic similarity between a back-translation and the original source text. This way we eliminate the need for human translated reference texts. By correlating BLEU and back-translation scores with human judgments, it could be shown. | The Back-translation Score Automatic MT Evaluation at the Sentence Level without Reference Translations Reinhard Rapp Universitat Rovira i Virgili Avinguda Catalunya 35 43002 Tarragona Spain Abstract Automatic tools for machine translation MT evaluation such as BLEU are well established but have the drawbacks that they do not perform well at the sentence level and that they presuppose manually translated reference texts. Assuming that the MT system to be evaluated can deal with both directions of a language pair in this research we suggest to conduct automatic MT evaluation by determining the orthographic similarity between a back-translation and the original source text. This way we eliminate the need for human translated reference texts. By correlating BLEU and back-translation scores with human judgments it could be shown that the back-translation score gives an improved performance at the sentence level. 1 Introduction The manual evaluation of the results of machine translation systems requires considerable time and effort. For this reason fast and inexpensive automatic methods were developed. They are based on the comparison of a machine translation with a reference translation produced by humans. The comparison is done by determining the number of matching word sequences between both translations. It could be shown that such methods of which BLEU Papineni et al. 2002 is the most common can deliver evaluation results that show a high agreement with human judgments Papineni et al. 2002 Coughlin 2003 Koehn Monz 2006 . Disadvantages of BLEU and related methods are that a human reference translation is required and that the results are reliable only at corpus level . when computed over many sentence pairs see . Callison-Burch et al. 2006 . However at the sentence level due to data sparseness the results tend to be unsatisfactory Agarwal Lavie 2008 Callison-Burch et al. 2008 . Pap-ineni et al. 2002 describe this as follows BLEU s .
đang nạp các trang xem trước