tailieunhanh - Báo cáo khoa học: "M AX S IM: A Maximum Similarity Metric for Machine Translation Evaluation"
We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. Unlike most metrics, we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. | Max Sim a Maximum Similarity Metric for Machine Translation Evaluation Yee Seng Chan and Hwee Tou Ng Department of Computer Science National University of Singapore Law Link Singapore 117590 chanys nght @ Abstract We propose an automatic machine translation MT evaluation metric that calculates a similarity score based on precision and recall of a pair of sentences. Unlike most metrics we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. This general framework allows us to use arbitrary similarity functions between items and to incorporate different information in our comparison such as n-grams dependency relations etc. When evaluated on data from the ACL-07 MT workshop our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop. 1 Introduction In recent years machine translation MT research has made much progress which includes the introduction of automatic metrics for MT evaluation. Since human evaluation of MT output is time consuming and expensive having a robust and accurate automatic MT evaluation metric that correlates well with human judgement is invaluable. Among all the automatic MT evaluation metrics BLEU Papineni et al. 2002 is the most widely used. Although BLEU has played a crucial role in the progress of MT research it is becoming evident that BLEU does not correlate with human judgement well enough and suffers from several other deficiencies such as the lack of an intuitive interpretation of its scores. During the recent ACL-07 workshop on statistical MT Callison-Burch et al. 2007 a total of 11 automatic MT evaluation metrics were evaluated for correlation with human judgement. The results show that as compared to BLEU several recently proposed metrics such as Semantic-role overlap .
đang nạp các trang xem trước