tailieunhanh - Báo cáo khoa học: "Noun-Phrase Analysis in Unrestricted Text for Information Retrieval"

Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. | Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans Chengxiang Zhai Laboratory for Computational Linguistics Carnegie Mellon Univeristy Pittsburgh PA 15213 dae@ cz25@ Abstract Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular we describe a hybrid approach to the extraction of meaningful continuous or discontinuous subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted subcompounds improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction. 1 Introduction Information Retrieval Information retrieval IR is an important application area of natural-language processing NLP .1 The IR or perhaps more accurately text retrieval task may be characterized as the problem of selecting a subset of documents from a document collection whose content is relevant to the information need of a user as expressed by a query. The document collections involved in IR are often gigabytes of unrestricted natural-language text. A user s query may be expressed in a controlled language . a boolean expression of keywords or more desirably a natural language such as English. A typical IR system works as follows. The documents to be retrieved are processed to extract indexing terms or content carriers which are usually 1 Evans 1990 Evans et al. 1993 Smeaton 1992 Lewis Sparck Jones 1996 single words or less typically phrases. The indexing terms provide a description of the document s

TÀI LIỆU LIÊN QUAN