tailieunhanh - Báo cáo khoa học: "Multi-Field Information Extraction and Cross-Document Fusion"

In this paper, we examine the task of extracting a set of biographic facts about target individuals from a collection of Web pages. We automatically annotate training text with positive and negative examples of fact extractions and train Rote, Na¨ve Bayes, ı and Conditional Random Field extraction models for fact extraction from individual Web pages. We then propose and evaluate methods for fusing the extracted information across documents to return a consensus answer. | Multi-Field Information Extraction and Cross-Document Fusion Gideon S. Mann and David Yarowsky Department of Computer Science The Johns Hopkins University Baltimore Md 21218 USA gsm yarowsky @ Abstract In this paper we examine the task of extracting a set of biographic facts about target individuals from a collection of Web pages. We automatically annotate training text with positive and negative examples of fact extractions and train Rote Naive Bayes and Conditional Random Field extraction models for fact extraction from individual Web pages. We then propose and evaluate methods for fusing the extracted information across documents to return a consensus answer. A novel cross-field bootstrapping method leverages data interdependencies to yield improved performance. 1 Introduction Much recent statistical information extraction research has applied graphical models to extract information from one particular document after training on a large corpus of annotated data Leek 1997 Freitag and McCallum 1999 .1 Such systems are widely applicable yet there remain many information extraction tasks that are not readily amenable to these methods. Annotated data required for training statistical extraction systems is sometimes unavailable while there are examples of the desired information. Further the goal may be to find a few interrelated pieces of information that are stated multiple times in a set of documents. Here we investigate one task that meets the above criteria. Given the name of a celebrity such as Alternatively Riloff 1996 trains on in-domain and out-of-domain texts and then has a human filtering step. Huffman 1995 proposes a method to train a different type of extraction system by example. Frank Zappa our goal is to extract a set of biographic facts . birthdate birth place and occupation about that person from documents on the Web. First we describe a general method of automatic annotation for training from positive and negative examples and use the .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.