tailieunhanh - Báo cáo khoa học: "Validating the web-based evaluation of NLG systems"

The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper, we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent, but the Internetbased approach offers the statistical power necessary for more fine-grained evaluations and is cheaper to carry out. | Validating the web-based evaluation of NLG systems Alexander Koller Kristina Striegnitz Saarland U. Union College koller@ striegnk@ Donna Byron Justine Cassell Northeastern U. Northwestern U. dbyron@ justine@ Robert Dale Sara Dalzel-Job Jon Oberlander Johanna Moore Macquarie U. U. of Edinburgh U. of Edinburgh U. of Edinburgh @ Abstract The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent but the Internetbased approach offers the statistical power necessary for more fine-grained evaluations and is cheaper to carry out. 1 Introduction Recently there has been an increased interest in evaluating and comparing natural language generation NLG systems on shared tasks Belz 2009 Dale and White 2007 Gatt et al. 2008 . However this is a notoriously hard problem Scott and Moore 2007 Task-based evaluations with human experimental subjects are time-consuming and expensive and corpus-based evaluations of NLG systems are problematic because a mismatch between humangenerated output and system-generated output does not necessarily mean that the system s output is inferior Belz and Gatt 2008 . This lack of evaluation methods which are both effective and efficient is a serious obstacle to progress in NLG research. The GIVE Challenge Byron et al. 2009 is a recent shared task which takes a third approach to NLG evaluation By connecting NLG systems to experimental subjects over the Internet it achieves a true task-based evaluation at a much lower cost. Indeed the first GIVE Challenge acquired data from over 1100 experimental subjects online. However it still remains to be shown that the results that .