tailieunhanh - Extracting and Querying a Comprehensive Web Database

Retailers usually know who buys their products. Use of social media and web log files from their ecommerce sites can help them understand who didn’t buy and why they chose not to, information not available to them today. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies. Finally, social media sites like Facebook and LinkedIn simply wouldn’t exist without big data. Their business model requires a personalized experience on the web, which can only be delivered by capturing and using all the available data about a. | Extracting and Querying a Comprehensive Web Database Michael J. Cafarella University of Washington Seattle WA 98107 Usa mjc@ ABSTRACT Recent research in domain-independent information extraction holds the promise of an automatically-constructed structured database derived from the Web. A query system based on this database would offer the same breadth as a Web search engine but with much more sophisticated query tools than are common today. Unfortunately these domain-independent Web extractors are usually not modelindependent . an extractor that only finds binary relations from text will be blind to relational data found in tables. Because a topic area often has a data model that is a natural fit . population statistics are usually in tables while biographical facts about Einstein are embedded in text even a high-quality domain-independent extractor will miss a substantial amount of data. Our OMNIVORE system attempts to build a comprehensive Web database by running multiple domain-independent extractors in parallel over a Web crawl then combining their outputs into a single large entity-relationship database. Each item in the database describes a single real-world entity and can contain information drawn from a number of popular Web data models. The user can correct flaws in the database and can query it using either a structured query language or a search-like interface. Due to the Web s sheer size users cannot be expected to know the result set s metadata a priori so OMNIVORE automatically chooses an output model and schema when it renders results. In this paper we outline the OMNIVORE architecture and provide specific details about our current prototype. Categories and Subject Descriptors H. 3 Information Storage and Retrieval Online Information Services Database Management Miscellaneous I. INTRODUCTION Domain-independent information extraction has been an active research area in the last few years often using a large This article is .