tailieunhanh - Báo cáo khoa học: "Evaluating Roget’s Thesauri"

Roget’s Thesaurus has gone through many revisions since it was first published 150 years ago. But how do these revisions affect Roget’s usefulness for NLP? We examine the differences in content between the 1911 and 1987 versions of Roget’s, and we test both versions with each other and WordNet on problems such as synonym identification and word relatedness. We also present a novel method for measuring sentence relatedness that can be implemented in either version of Roget’s or in WordNet. . | Evaluating Roget s Thesauri Alistair Kennedy School of Information Technology and Engineering University of Ottawa Ottawa Ontario Canada akennedy@ Abstract Roget s Thesaurus has gone through many revisions since it was first published 150 years ago. But how do these revisions affect Ro-get s usefulness for NLP We examine the differences in content between the 1911 and 1987 versions of Roget s and we test both versions with each other and WordNet on problems such as synonym identification and word relatedness. We also present a novel method for measuring sentence relatedness that can be implemented in either version of Roget s or in WordNet. Although the 1987 version of the Thesaurus is better we show that the 1911 version performs surprisingly well and that often the differences between the versions of Ro-get s and WordNet are not statistically significant. We hope that this work will encourage others to use the 1911 Roget s Thesaurus in NLP tasks. 1 Introduction Roget s Thesaurus first introduced over 150 years ago has gone through many revisions to reach its current state. We compare two versions the 1987 and 1911 editions of the Thesaurus with each other and with WordNet . Roget s Thesaurus has a unique structure quite different from WordNet of which the NLP community has yet to take full advantage. In this paper we demonstrate that although the 1911 version of the Thesaurus is very old it can give results comparable to systems that use WordNet or newer versions of Roget s Thesaurus. The main motivation for working with the 1911 Thesaurus instead of newer versions is that it is in Stan Szpakowicz School of Information Technology and Engineering University of Ottawa Ottawa Ontario Canada and Institute of Computer Science Polish Academy of Sciences Warsaw Poland szpak@ the public domain along with related NLP-oriented software packages. For applications that call for an NLP-friendly thesaurus WordNet has become the de-facto .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.