tailieunhanh - Báo cáo khoa học: "Efficient Online Locality Sensitive Hashing via Reservoir Counting"
We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint. | Efficient Online Locality Sensitive Hashing via Reservoir Counting Benjamin Van Durme HLTCOE Johns Hopkins University Ashwin Lall Mathematics and Computer Science Denison University Abstract We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting allowing for maintaining a larger number of signatures or an increased level of approximation accuracy at a similar memory footprint. 1 Introduction Feature vectors based on lexical co-occurrence are often of a high dimension d. This leads to O d operations to calculate cosine similarity a fundamental tool in distributional semantics. This is improved in practice through the use of data structures that exploit feature sparsity leading to an expected O f operations where f is the number of unique features we expect to have non-zero entries in a given vector. Ravichandran et al. 2005 showed that the Lo cality Sensitive Hash LSH procedure of Charikar 2002 following from Indyk and Motwani 1998 and Goemans and Williamson 1995 could be suc cessfully used to compress textually derived fea ture vectors in order to achieve speed efficiencies in large-scale noun clustering. Such LSH bit signa tures are constructed using the following hash func- tion where V E Rd is a vector in the original feature space and r is randomly drawn from N 0 1 d h V I 0 if V r 0 otherwise. If hb v is the b-bit signature resulting from b such hash functions then the cosine similarity between vectors u and V is approximated by ce s u. v U L cos DfoMk W TT _ cos u V U V cos b n where D - is Hamming distance the number of bits that disagree. This technique is used when b d which leads to faster pair-wise comparisons between vectors and a lower memory footprint. Van Durme and Lall 2010 observed1 that if the feature values are additive over a dataset . when collecting word co-occurrence frequencies then these signatures may be .
đang nạp các trang xem trước