Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Detecting Novel Compounds: The Role of Distributional Evidence"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

Research on the discovery of terms from corpora has focused on word sequences whose recurrent occurrence in a corpus is indicative of their terminological status, and has not addressed the issue of discovering terms when data is sparse. This becomes apparent in the case of noun compounding, which is extremely productive: more than half of the candidate compounds extracted from a corpus are attested only once. We show how evidence about established (i.e., frequent) compounds can be used to estimate features that can discriminate rare valid compounds from rare nonce terms in addition to a variety of linguistic features. | Detecting Novel Compounds The Role of Distributional Evidence Mirella Lapata Department of Computer Science University of Sheffield Regent Court 211 Portobello Street Sheffield SI 4DP UK mlap@dcs.shef.ac.uk Alex Lascarides School of Informatics The University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW UK alex@inf.ed.ac.uk Abstract Research on the discovery of terms from corpora has focused on word sequences whose recuưent occurrence in a corpus is indicative of their terminological status and has not addressed the issue of discovering terms when data is sparse. This becomes apparent in the case of noun compounding which is extremely productive more than half of the candidate compounds extracted from a corpus are attested only once. We show how evidence about established i.e. frequent compounds can be used to estimate features that can discriminate rare valid compounds from rare nonce terms in addition to a variety of linguistic features than can be easily gleaned from corpora without relying on parsed text. 1 Introduction The nature and properties of compounds have been studied at length in the theoretical linguistics literature. It is a well-known fact that compound noun formation in English is relatively productive see 1 . Although compounds are typically binary see la b they can be also longer than two words see le . Compounds are commonly written as a concatenation of words see la b or as single words see lc sometimes a hyphen is also used see le . 1 a. income tax b. AT T headquarters c. bathroom d. public-relations e. income-tax relief The use of noun compounds is frequent not only in technical writing and newswire text McDonald 1982 but also in fictional prose Leonard 1984 and spoken language Liberman and Sproat 1992 . Novel compounds are used as a text compression device Marsh 1984 i.e. to pack meaning into a minimal amount of linguistic structure as a deictic device or as a means to classify an entity which has no specific name Downing 1977 . .