tailieunhanh - Báo cáo khoa học: "Correlating Human and Automatic Evaluation of a German Surface Realiser"

We examine correlations between native speaker judgements on automatically generated German text against automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the General Text Matcher (GTM) tool correlates best overall, although in general, correlation between the human judgements and the automatic metrics was quite weak. and fluency on automatically generated English paraphrases. . | Correlating Human and Automatic Evaluation of a German Surface Realiser Aoife Cahill Institut fur Maschinelle Sprachverarbeitung IMS University of Stuttgart 70174 Stuttgart Germany Abstract We examine correlations between native speaker judgements on automatically generated German text against automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast on a naturalness judgement task the General Text Matcher GTM tool correlates best overall although in general correlation between the human judgements and the automatic metrics was quite weak. 1 Introduction During the development of a surface realisation system it is important to be able to quickly and automatically evaluate its performance. The evaluation of a string realisation system usually involves string comparisons between the output of the system and some gold standard set of strings. Typically automatic metrics from the fields of Machine Translation . BLEU or Summarisation . ROUGE are used but it is not clear how successful or even appropriate these are. Belz and Reiter 2006 and Reiter and Belz 2009 describe comparison experiments between the automatic evaluation of system output and human expert and non-expert evaluation of the same data English weather forecasts . Their findings show that the NIST metric correlates best with the human judgements and all automatic metrics favour systems that generate based on frequency. They conclude that automatic evaluations should be accompanied by human evaluations where possible. Stent et al. 2005 investigate a number of automatic evaluation methods for generation in terms of adequacy and fluency on automatically generated English paraphrases. They find that the automatic metrics are reasonably good at measuring adequacy but not .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.