tailieunhanh - Báo cáo khoa học: "Different Structures for Evaluating Answers to Complex Questions: Pyramids Won’t Topple, and Neither Will Human Assessors"

The idea of “nugget pyramids” has recently been introduced as a refinement to the nugget-based methodology used to evaluate answers to complex questions in the TREC QA tracks. This paper examines data from the 2006 evaluation, the first large-scale deployment of the nugget pyramids scheme. We show that this method of combining judgments of nugget importance from multiple assessors increases the stability and discriminative power of the evaluation while introducing only a small additional burden in terms of manual assessment. . | Different Structures for Evaluating Answers to Complex Questions Pyramids Won t Topple and Neither Will Human Assessors Hoa Trang Dang Information Access Division National Institute of Standards and Technology Gaithersburg MD 20899 Jimmy Lin College of Information Studies University of Maryland College Park MD 20742 jimmylin@ Abstract The idea of nugget pyramids has recently been introduced as a refinement to the nugget-based methodology used to evaluate answers to complex questions in the TREC QA tracks. This paper examines data from the 2006 evaluation the first large-scale deployment of the nugget pyramids scheme. We show that this method of combining judgments of nugget importance from multiple assessors increases the stability and discriminative power of the evaluation while introducing only a small additional burden in terms of manual assessment. We also consider an alternative method for combining assessor opinions which yields a distinction similar to micro- and macro-averaging in the context of classification tasks. While the two approaches differ in terms of underlying assumptions their results are nevertheless highly correlated. 1 Introduction The emergence of question answering QA systems for addressing complex information needs has necessitated the development and refinement of new methodologies for evaluating and comparing systems. In the Text REtrieval Conference TREC QA tracks organized by the . National Institute of Standards and Technology NIST improvements in evaluation processes have kept pace with the evolution of QA tasks. For the past several years NIST has implemented an evaluation methodology based 768 on the notion of information nuggets to assess answers to complex questions. As it has become the de facto standard for evaluating such systems the research community stands to benefit from a better understanding of the characteristics of this evaluation methodology. This paper explores recent refinements to the .