tailieunhanh - Báo cáo khoa học: "Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries"

Automatic summarization evaluation is critical to the development of summarization systems. While ROUGE has been shown to correlate well with human evaluation for content match in text summarization, there are many characteristics in multiparty meeting domain, which may pose potential problems to ROUGE. In this paper, we carefully examine how well the ROUGE scores correlate with human evaluation for extractive meeting summarization. | Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries Feifan Liu Yang Liu The University of Texas at Dallas Richardson TX 75080 USA ffliu yangl@ Abstract Automatic summarization evaluation is critical to the development of summarization systems. While ROUGE has been shown to correlate well with human evaluation for content match in text summarization there are many characteristics in multiparty meeting domain which may pose potential problems to ROUGE. In this paper we carefully examine how well the ROUGE scores correlate with human evaluation for extractive meeting summarization. Our experiments show that generally the correlation is rather low but a significantly better correlation can be obtained by accounting for several unique meeting characteristics such as disfluencies and speaker information especially when evaluating system-generated summaries. 1 Introduction Meeting summarization has drawn an increasing attention recently therefore a study on the automatic evaluation metrics for this task is timely. Automatic evaluation helps to advance system development and avoids the labor-intensive and potentially inconsistent human evaluation. ROUGE Lin 2004 has been widely used for summarization evaluation. In the news article domain ROUGE scores have been shown to be generally highly correlated with human evaluation in content match Lin 2004 . However there are many differences between written texts . news wire and spoken documents especially in the meeting domain for example the presence of disfluencies and multiple speakers and the lack of structure in spontaneous utterances. The question of whether ROUGE is a good metric for meeting summarization is unclear. Murray et al. 2005 have reported that ROUGE-1 unigram match scores have low correlation with human evaluation in meetings. In this paper we investigate the correlation between ROUGE and human evaluation of extractive meeting summaries and focus on two issues .

TÀI LIỆU LIÊN QUAN