Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Extending the BLEU MT Evaluation Method with Frequency Weightings"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We present the results of an experiment on extending the automatic method of Machine Translation evaluation BLUE with statistical weights for lexical items, such as tf.idf scores. We show that this extension gives additional information about evaluated texts; in particular it allows us to measure translation Adequacy, which, for statistical MT systems, is often overestimated by the baseline BLEU method. The proposed model uses a single human reference translation, which increases the usability of the proposed method for practical purposes. . | Extending the BLEU MT Evaluation Method with Frequency Weightings Bogdan Babych Centre for Translation Studies University of Leeds Leeds LS2 9JT UK bogdan@comp.leeds.ac.uk Anthony Hartley Centre for Translation Studies University of Leeds Leeds LS2 9JT UK a.hartley@leeds.ac.uk Abstract We present the results of an experiment on extending the automatic method of Machine Translation evaluation BLUE with statistical weights for lexical items such as tf.idf scores. We show that this extension gives additional information about evaluated texts in particular it allows us to measure translation Adequacy which for statistical MT systems is often overestimated by the baseline BLEU method. The proposed model uses a single human reference translation which increases the usability of the proposed method for practical purposes. The model suggests a linguistic interpretation which relates frequency weights and human intuition about translation Adequacy and Fluency. 1. Introduction Automatic methods for evaluating different aspects of MT quality - such as Adequacy Fluency and Informativeness - provide an alternative to an expensive and time-consuming process of human MT evaluation. They are intended to yield scores that correlate with human judgments of translation quality and enable systems machine or human to be ranked on this basis. Several such automatic methods have been proposed in recent years. Some of them use human reference translations e.g. the BLEU method Papineni et al. 2002 which is based on comparison of N-gram models in MT output and in a set of human reference translations. However a serious problem for the BLEU method is the lack of a model for relative importance of matched and mismatched items. Words in text usually carry an unequal informational load and as a result are of differing importance for translation. It is reasonable to expect that the choices of right translation equivalents for certain key items such as expressions denoting principal events event