tailieunhanh - Báo cáo khoa học: "Automatic Set Instance Extraction using the Web"

An important and well-studied problem is the production of semantic lexicons from a large corpus. In this paper, we present a system named ASIA (Automatic Set Instance Acquirer), which takes in the name of a semantic class as input (., “car makers”) and automatically outputs its instances (., “ford”, “nissan”, “toyota”). ASIA is based on recent advances in webbased set expansion - the problem of finding all instances of a set given a small number of “seed” instances. This approach effectively exploits web resources and can be easily adapted to different languages. . | Automatic Set Instance Extraction using the Web Richard C. Wang Language Technologies Institute Carnegie Mellon University rcwang@ William W. Cohen Machine Learning Department Carnegie Mellon University wcohen@ Abstract An important and well-studied problem is the production of semantic lexicons from a large corpus. In this paper we present a system named ASIA Automatic Set Instance Acquirer which takes in the name of a semantic class as input . car makers and automatically outputs its instances . ford nissan toyota . ASIA is based on recent advances in webbased set expansion - the problem of finding all instances of a set given a small number of seed instances. This approach effectively exploits web resources and can be easily adapted to different languages. In brief we use languagedependent hyponym patterns to find a noisy set of initial seeds and then use a state-of-the-art language-independent set expansion system to expand these seeds. The proposed approach matches or outperforms prior systems on several English-language benchmarks. It also shows excellent performance on three dozen additional benchmark problems from English Chinese and Japanese thus demonstrating language-independence. 1 Introduction An important and well-studied problem is the production of semantic lexicons for classes of interest that is the generation of all instances of a set . apple orange banana given a name of that set . fruits . This task is often addressed by linguistically analyzing very large collections of text Hearst 1992 Kozareva et al. 2008 Etzioni et al. 2005 Pantel and Ravichandran 2004 Pasca 2004 often using hand-constructed or machine-learned shallow linguistic patterns to detect hyponym instances. A hyponym is a word or phrase whose semantic range English Chinese Japanese Is Amazing Race K 7 Ẳ è A Survivor WK Big Brother 7 The Mole ỳẠít-O 5 The Apprentice d Lb7 o Project Runway SfW 7 7-1 - The Bachelor Wii ỉ 7-ềA Figure 1 Examples of seal s