tailieunhanh - Báo cáo khoa học: "Exploiting Structure for Event Discovery Using the MDI Algorithm"

Effectively identifying events in unstructured text is a very difficult task. This is largely due to the fact that an individual event can be expressed by several sentences. In this paper, we investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. The key idea is to cluster the sentences, using a novel distance metric that exploits regularities in the sequential structure of events within a document. | Exploiting Structure for Event Discovery Using the MDI Algorithm Martina Naughton School of Computer Science Informatics University College Dublin Ireland Abstract Effectively identifying events in unstructured text is a very difficult task. This is largely due to the fact that an individual event can be expressed by several sentences. In this paper we investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. The key idea is to cluster the sentences using a novel distance metric that exploits regularities in the sequential structure of events within a document. When this approach is compared to a simple bag of words baseline a statistically significant increase in performance is observed. 1 Introduction Accurately identifying events in unstructured text is an important goal for many applications that require natural language understanding. There has been an increased focus on this problem in recent years. The Automatic Content Extraction ACE program1 is dedicated to developing methods that automatically infer meaning from language data. Tasks include the detection and characterisation of Entities Relations and Events. Extensive research has been dedicated to entity recognition and binary relation detection with significant results Bikel et al. 1999 . However event extraction is still considered as one of the most challenging tasks because an individual event can be expressed by several sentences Xu et al. 2006 . In this paper we primarily focus on techniques for identifying events within a given news article. Specifically we describe and evaluate clustering 1http speech tests ace methods for the task of grouping sentences in a news article that refer to the same event. We generate sentence clusters using three variations of the well-documented Hierarchical Agglomerative Clustering HAC Manning and Schutze 1999 as a baseline for this task. We provide .