tailieunhanh - Báo cáo khoa học: "SEXTANT: EXPLORING UNEXPLORED CONTEXTS FOR SEMANTIC EXTRACTION FROM SYNTACTIC ANALYSIS"

For a very long time, it has been considered that the only way of automatically extracting similar groups of words from a text collection for which no semantic information exists is to use docum e n t co-occurrence data. But, with robust syntactic parsers that are becoming more frequently available, syntactically recognizable p h e n o m e n a about word usage can be confidently noted in large collections of texts. | SEXTANT EXPLORING UNEXPLORED CONTEXTS FOR SEMANTIC EXTRACTION FROM SYNTACTIC ANALYSIS Gregory Grefenstette Computer Science Department University of Pittsburgh Pittsburgh PA 15260 grefen@cs .pit t .edu Abstract For a very long time it has been considered that the only way of automatically extracting similar groups of words from a text collection for which no semantic information exists is to use document co-occurrence data. But with robust syntactic parsers that are becoming more frequently available syntactically recognizable phenomena about word usage can be confidently noted in large collections of texts. We present here a new system called SEXTANT which uses these parsers and the finer-grained contexts they produce to judge word similarity. BACKGROUND Many machine-based approaches to term similarity such as found in TRUMP Jacobs and Zemick 1988 and FERRET Mauldin 1991 can be characterized as knowledge-rich in that they presuppose that known lexical items possess Conceptual Dependence CD -like descriptions. Such an approach necessitates a great amount of manual encoding of semantic information and suffers from the drawbacks of cost in terms of initial coding coherence checking maintenance after modifications and costs derivable from a host of other software engineering concern of domain dependence a semantic structure developed for one domain would not be applicable to another. For example sugar would have very different semantic relations in a medical domain than in a commodities exchange domain and of rigidity even within well-established domain new subdomains spring up . AIDS. Can hand-coded systems keep up with new discoveries and new relations with an acceptable latency In the Information Retrieval community researchers have consistently considered that the linguistic apparatus required for effective domain-independent analysis is not yet at hand and have concentrated on counting document co-occurrence statistics Peat and Willet 1991 based on the idea .

TÀI LIỆU LIÊN QUAN