tailieunhanh - Báo cáo khoa học: "Mining Web Sites Using Adaptive Information Extraction"
Adaptive Information Extraction systems (IES) are currently used by some Semantic Web (SW) annotation tools as support to annotation (Handschuh et al., 2002; Vargas-Vera et al., 2002). They are generally based on fully supervised methodologies requiring fairly intense domain-specific annotation. Unfortunately, selecting representative examples may be difficult and annotations can be incorrect and require time. In this paper we present a methodology that drastically reduce (or even remove) the amount of manual annotation required when annotating consistent sets of pages. . | Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks Department of Computer Science University of Sheffield Regent Court 211 Portobello Street S I 4DP Sheffield UK 1 Introduction Adaptive Information Extraction systems IES are cuưently used by some Semantic Web SW annotation tools as support to annotation Hand-schuh et al. 2002 Vargas-Vera et al. 2002 . They are generally based on fully supervised methodologies requiring fairly intense domain-specific annotation. Unfortunately selecting representative examples may be difficult and annotations can be incorrect and require time. In this paper we present a methodology that drastically reduce or even remove the amount of manual annotation required when annotating consistent sets of pages. A very limited number of user-defined examples are used to bootstrap learning. Simple high precision and possibly high recall IE patterns are induced using such examples these patterns will then discover more examples which will in turn discover more patterns etc. The key feature that enables such bootstrapping is the Redundancy on the Web. Redundancy is given by the presence of multiple citations of the same facts in different superficial formats and is currently used for several tasks such as improving question answering systems Dumais et al. 2002 and performing information extraction using machine learning Mitchell 2001 . When known information is presented in different sources it is possible to use its multiple occurrences to bootstrap recognisers that when generalised will retrieve other pieces of information producing in turn more generic recognisers. In our model redundancy of information is increased by using preexisting services . search engines digital li braries . This improves the effectiveness of bootstrapping. Another typical feature of Web pages that we exploit for learning is document formatting-. HTML and XML pages often contain formatting .
đang nạp các trang xem trước