tailieunhanh - Báo cáo khoa học: "Extracting Hypernym Pairs from the Web"

We apply pattern-based methods for collecting hypernym relations from the web. We compare our approach with hypernym extraction from morphological clues and from large text corpora. We show that the abundance of available data on the web enables obtaining good results with relatively unsophisticated techniques. | Extracting Hypernym Pairs from the Web Erik Tjong Kim Sang ISLA Informatics Institute University of Amsterdam erikt@ Abstract We apply pattern-based methods for collecting hypernym relations from the web. We compare our approach with hypernym extraction from morphological clues and from large text corpora. We show that the abundance of available data on the web enables obtaining good results with relatively unsophisticated techniques. 1 Introduction WordNet is a key lexical resource for natural language applications. However its coverage currently 155k synsets for the English WordNet is far from complete. For languages other than English the available WordNets are considerably smaller like for Dutch with a 44k synset WordNet. Here the lack of coverage creates bigger problems. A manual extension of the WordNets is costly. Currently there is a lot of interest in automatic techniques for updating and extending taxonomies like WordNet. Hearst 1992 was the hrst to apply hxed syntactic patterns like such NP as NP for extracting hypernym-hyponym pairs. Carballo 1999 built noun hierarchies from evidence collected from conjunctions. Pantel Ravichandran and Hovy 2004 learned syntactic patterns for identifying hypernym relations and combined these with clusters built from co-occurrence information. Recently Snow Jurafsky and Ng 2005 generated tens of thousands of hypernym patterns and combined these with noun clusters to generate high-precision suggestions for unknown noun insertion into WordNet Snow et al. 2006 . The previously mentioned papers deal with 165 English. Little work has been done for other languages. IJzereef 2004 used hxed patterns to extract Dutch hypernyms from text and encyclopedias. Van der Plas and Bouma 2005 employed noun distribution characteristics for extending the Dutch part of EuroWordNet. In earlier work different techniques have been applied to large and very large text corpora. Today the web contains more data than the largest .