tailieunhanh - Báo cáo khoa học: "Names and Similarities on the Web: Fact Extraction in the Fast Lane"

In a new approach to large-scale extraction of facts from unstructured text, distributional similarities become an integral part of both the iterative acquisition of high-coverage contextual extraction patterns, and the validation and ranking of candidate facts. The evaluation measures the quality and coverage of facts extracted from one hundred million Web documents, starting from ten seed facts and using no additional knowledge, lexicons or complex tools. | Names and Similarities on the Web Fact Extraction in the Fast Lane Marius Pasca Google Inc. Mountain View CA 94043 mars@ Dekang Lin Google Inc. Mountain View CA 94043 lindek@ Jeffrey Bigham University of Washington Seattle WA 98195 jbigham@ Andrei Lifehits University of British Columbia Vancouver BC V6T 1Z4 alifchit@ Alpa Jain Columbia University New York NY 10027 alpa@ Abstract In a new approach to large-scale extraction of facts from unstructured text distributional similarities become an integral part of both the iterative acquisition of high-coverage contextual extraction patterns and the validation and ranking of candidate facts. The evaluation measures the quality and coverage of facts extracted from one hundred million Web documents starting from ten seed facts and using no additional knowledge lexicons or complex tools. 1 Introduction Background The potential impact of structured fact repositories containing billions of relations among named entities on Web search is enormous. They enable the pursuit of new search paradigms the processing of database-like queries and alternative methods of presenting search results. The preparation of exhaustive lists of hand-written extraction rules is impractical given the need for domainindependent extraction of many types of facts from unstructured text. In contrast the idea of bootstrapping for relation and information extraction was first proposed in Riloff and Jones 1999 and successfully applied to the construction of semantic lexicons Thelen and Riloff 2002 named entity recognition Collins and Singer 1999 extraction of binary relations Agichtein and Gravano 2000 and acquisition of structured data for tasks such as Question Answering Lita and Carbonell 2004 Fleischman et al. 2003 . In the context of fact extraction the resulting iterative acquisition Work done during internships at Google Inc. framework starts from a small set of seed facts finds .