tailieunhanh - Báo cáo khoa học: "Human Evaluation of a German Surface Realisation Ranker"

In this paper we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, loglinear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but there are also clearly factors that make certain realisation alternatives more natural. . | Human Evaluation of a German Surface Realisation Ranker Aoife Cahill Institut fur Maschinelle Sprachverarbeitung IMS University of Stuttgart 70174 Stuttgart Germany Martin Forst Palo Alto Research Center 3333 Coyote Hill Road Palo Alto CA 94304 USA mforst@ Abstract In this paper we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models language model log-linear model as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order but there are also clearly factors that make certain realisation alternatives more natural. 1 Introduction An important component of research on surface realisation the task of generating strings for a given abstract representation is evaluation especially if we want to be able to compare across systems. There is consensus that exact match with respect to an actually observed corpus sentence is too strict a metric and that BLEU score measured against corpus sentences can only give a rough impression of the quality of the system output. It is unclear however what kind of metric would be most suitable for the evaluation of string realisations so that as a result there have been a range of automatic metrics applied including inter alia exact match string edit distance NIST SSA BLEU NIST ROUGE generation string accuracy generation tree accuracy word accuracy Bangalore et al. 2000 Callaway 2003 Nakanishi et al. 2005 Velldal and Oepen 2006 Belz and Reiter 2006 . It is not always clear how appropriate these metrics are especially at the level of individual sentences. Using automatic evaluation metrics cannot be avoided but ideally a metric for the evaluation of realisation rankers would rank .

TỪ KHÓA LIÊN QUAN