tailieunhanh - Báo cáo khoa học: "Scaling Distributional Similarity to Large Corpora"

Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the na¨ve nearestı neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2 ) in the vocabulary size). In this paper, we compare several existing approaches to approximating the nearestneighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance. . | Scaling Distributional Similarity to Large Corpora James Gorman and James R. Curran School of Information Technologies University of Sydney NSW 2006 Australia jgorman2 james @ Abstract Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However the naive nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly O n2 in the vocabulary size . In this paper we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy and find that SASH Houle and Sakuma 2005 provides the best balance. 1 Introduction It is a general property of Machine Learning that increasing the volume of training data increases the accuracy of results. This is no more evident than in Natural Language Processing nlp where massive quantities of text are required to model rare language events. Despite the rapid increase in computational power available for NLP systems the volume of raw data available still outweighs our ability to process it. Unsupervised learning which does not require the expensive and timeconsuming human annotation of data offers an opportunity to use this wealth of data. Curran and Moens 2002 show that synonymy extraction for lexical semantic resources using distributional similarity produces continuing gains in accuracy as the volume of input data increases. Extracting synonymy relations using distributional similarity is based on the distributional hypothesis that similar words appear in similar contexts. Terms are described by collating informa tion about their occurrence in a corpus into vectors. These context vectors are then compared for similarity. Existing approaches differ primarily in their definition of context . the surrounding words or the entire document and their choice of distance metric for calculating similarity

TÀI LIỆU MỚI ĐĂNG
19    229    0    27-04-2024
34    212    1    27-04-2024
46    187    0    27-04-2024
8    175    0    27-04-2024
20    198    2    27-04-2024
37    141    0    27-04-2024
33    125    0    27-04-2024
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.