tailieunhanh - Báo cáo khoa học: "Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics"

In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring insequence n-grams automatically. | Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey CA 90292 USA cyl och @ Abstract In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring insequence n-grams automatically. The second method relaxes strict n-gram matching to skipbigram matching. Skip-bigram is any pair of words in their sentence order. Skip-bigram cooccurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. The empirical results show that both methods correlate with human judgments very well in both adequacy and fluency. 1 Introduction Using objective functions to automatically evaluate machine translation quality is not new. Su et al. 1992 proposed a method based on measuring edit distance Levenshtein 1966 between candidate and reference translations. Akiba et al. 2001 extended the idea to accommodate multiple references. NieBen et al. 2000 calculated the length-normalized edit distance called word error rate WER between a candidate and multiple reference translations. Leusch et al. 2003 proposed a related measure called position-independent word error rate PER that did not consider word position . using bag-of-words instead. Instead of error measures we can also use accuracy measures that compute similarity between candidate and reference translations in proportion to the number of common words between them as suggested by Melamed 1995 . An n-gram co-occurrence measure Bleu proposed by Papineni et al. 2001 that calculates co-occurrence .