tailieunhanh - Báo cáo khoa học: "Instance Splitting Strategies for Dependency Relation-based Information Extraction"

Information Extraction (IE) is a fundamental technology for NLP. Previous methods for IE were relying on co-occurrence relations, soft patterns and properties of the target (for example, syntactic role), which result in problems of handling paraphrasing and alignment of instances. Our system ARE (Anchor and Relation) is based on the dependency relation model and tackles these problems by unifying entities according to their dependency relations, which we found to provide more invariant relations between entities in many cases. . | ARE Instance Splitting Strategies for Dependency Relation-based Information Extraction Mstislav Maslennikov Hai-Kiat Goh Tat-Seng Chua Department of Computer Science School of Computing National University of Singapore maslenni gohhaiki chuats @ Abstract Information Extraction IE is a fundamental technology for NLP. Previous methods for IE were relying on co-occurrence relations soft patterns and properties of the target for example syntactic role which result in problems of handling paraphrasing and alignment of instances. Our system ARE Anchor and Relation is based on the dependency relation model and tackles these problems by unifying entities according to their dependency relations which we found to provide more invariant relations between entities in many cases. In order to exploit the complexity and characteristics of relation paths we further classify the relation paths into the categories of easy average and hard and utilize different extraction strategies based on the characteristics of those categories. Our extraction method leads to improvement in performance by 3 and 6 for MUC4 and MUC6 respectively as compared to the state-of-art IE systems. 1 Introduction Information Extraction IE is one of the fundamental problems of natural language processing. Progress in IE is important to enhance results in such tasks as Question Answering Information Retrieval and Text Summarization. Multiple efforts in MUC series allowed IE systems to achieve nearhuman performance in such domains as biological Humphreys et al. 2000 terrorism Kaufmann 1992 Kaufmann 1993 and management succession Kaufmann 1995 . The IE task is formulated for MUC series as filling of several predefined slots in a template. The terrorism template consists of slots Perpetrator Victim and Target the slots in the management succession template are Org PersonIn PersonOut and Post. We decided to choose both terrorism and management succession domains from MUC4 and MUC6 respectively in .