tailieunhanh - Báo cáo khoa học: "Online Generation of Locality Sensitive Hash Signatures"

Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique. . | Online Generation of Locality Sensitive Hash Signatures Benjamin Van Durme HLTCOE Johns Hopkins University Baltimore MD 21211 USA Ashwin Lall College of Computing Georgia Institute of Technology Atlanta GA 30332 USA Abstract Motivated by the recent interest in streaming algorithms for processing large text collections we revisit the work of Ravichandran et al. 2005 on using the Locality Sensitive Hash LSH method of Charikar 2002 to enable fast approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream we show that LSH signatures can be maintained online without additional approximation error and with lower memory requirements than when using the standard offline technique. 1 Introduction There has been a surge of interest in adapting results from the streaming algorithms community to problems in processing large text collections. The term streaming refers to a model where data is made available sequentially and it is assumed that resource limitations preclude storing the entirety of the data for offline batch processing. Statistics of interest are approximated via online randomized algorithms. Examples of text applications include collecting approximate counts Talbot 2009 Van Durme and Lall 2009a finding top-n elements Goyal et al. 2009 estimating term co-occurrence Li et al. 2008 adaptive language modeling Levenberg and Osborne 2009 and building top-k ranklists based on pointwise mutual information Van Durme and Lall 2009b . Here we revisit the work of Ravichandran et al. 2005 on building word similarity measures from large text collections by using the Locality Sensitive Hash LSH method of Charikar 2002 . For the common case of feature updates being additive over a data stream such as when tracking lexical co-occurrence we show that LSH signatures can be maintained online without additional approximation error and with lower memory requirements than when using the standard offline technique. We .

TỪ KHÓA LIÊN QUAN