tailieunhanh - Báo cáo khoa học: "An Integrated Term-Based Corpus Query System"

In this paper we describe the X-TRACT workbench, which enables efficient termbased querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying, a corpus is automatically terminologically analysed by the ATRACT system, which performs terminology recognition based on the C/NCvalue method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML, and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We. | An Integrated Term-Based Corpus Query System Irena Spasic Goran Nenadic Computer Science Dept of Computation University of Salford UMIST I. Spasic@ Kostas Manios Computer Science University of Salford @ Sophia Ananiadou Computer Science University of Salford Abstract In this paper we describe the X-TRACT workbench which enables efficient termbased querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying a corpus is automatically terminologically analysed by the ATRACT system which performs terminology recognition based on the C NC-value method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We illustrate the way in which the X-TRACT workbench can be utilised for knowledge discovery literature mining and conceptual information extraction. 1 Introduction New scientific discoveries usually result in an abundance of publications verbalising these findings in an attempt to share new knowledge with other scientists. Electronically available texts are continually being created and updated and thus the knowledge represented in such texts is more up-to-date than in any other media. The sheer amount of published papers1 makes it difficult for a human to efficiently 1 For example the Medline database PubMed currently contains over 12 million abstracts in the domains of molecular biology biomedicine and medicine growing by more than abstracts each month. localise the information of interest not only in a collection of documents but also within a single document. The growing number of electronically available .

TỪ KHÓA LIÊN QUAN