Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Comparing Automatic and Human Evaluation of NLG Systems"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NIST, BLEU, and ROUGE. We find that NIST scores correlate best ( 0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. . | Comparing Automatic and Human Evaluation of NLG Systems Anja Belz Natural Language Technology Group CMIS University of Brighton UK A.S.Belz@brighton.ac.uk Ehud Reiter Dept of Computing Science University of Aberdeen UK ereiter@csd.abdn.ac.uk Abstract We consider the evaluation problem in Natural Language Generation NLG and present results for evaluating several NLG systems with similar functionality including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts human non-experts and several automatic evaluation metrics including NIST BLEU and ROUGE. We find that NIST scores correlate best 0.8 with human judgments but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential in particular where high-quality reference texts and only a small number of human evaluators are available. However in general it is probably best for automatic evaluations to be supported by human-based evaluations or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain. 1 Introduction Evaluation is becoming an increasingly important topic in Natural Language Generation NLG as in other fields of computational linguistics. Some NLG researchers are impressed by the success of the BLEU evaluation metric Papineni et al. 2002 in Machine Translation MT which has transformed the MT field by allowing researchers to quickly and cheaply evaluate the impact of new ideas algorithms and data sets. BLEU and related metrics work by comparing the output of an MT system to a set of reference gold standard translations and in principle this kind of evaluation could be done with NLG systems as well. Indeed NLG researchers are already starting to use BLEU Habash 2004 Belz 2005 in their evaluations as this is much cheaper and easier to