tailieunhanh - Báo cáo khoa học: "Dependency Tree Kernels for Relation Extraction"

We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility of different features such as Wordnet hypernyms, parts of speech, and entity types, and find that the dependency tree kernel achieves a 20% F1 improvement over a “bag-of-words” kernel. | Dependency Tree Kernels for Relation Extraction Aron Culotta University of Massachusetts Amherst MA 01002 USA culotta@ Jeffrey Sorensen IBM TJ. Watson Research Center Yorktown Heights NY 10598 USA sorenj@ Abstract We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine we detect and classify relations between entities in the Automatic Content Extraction ACE corpus of news articles. We examine the utility of different features such as Wordnet hypernyms parts of speech and entity types and find that the dependency tree kernel achieves a 20 F1 improvement over a bag-of-words kernel. 1 Introduction The ability to detect complex patterns in data is limited by the complexity of the data s representation. In the case of text a more structured data source . a relational database allows richer queries than does an unstructured data source . a collection of news articles . For example current web search engines would not perform well on the query list all California-based CEOs who have social ties with a United States Senator. Only a structured representation of the data can effectively provide such a list. The goal of Information Extraction IE is to discover relevant segments of information in a data stream that will be useful for structuring the data. In the case of text this usually amounts to finding mentions of interesting entities and the relations that join them transforming a large corpus of unstructured text into a relational database with entries such as those in Table 1. IE is commonly viewed as a three stage process first an entity tagger detects all mentions of interest second coreference resolution resolves disparate mentions of the same entity third a relation extractor finds relations between these entities. Entity tagging has been thoroughly addressed by many statistical machine learning techniques obtaining greater than 90 F1 .