tailieunhanh - Báo cáo khoa học: "Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering"
In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed. | Randomized Algorithms and NLP Using Locality Sensitive Hash Function for High Speed Noun Clustering Deepak Ravichandran Patrick Pantel and Eduard Hovy Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey CA 90292. ravichan pantel hovy @ Abstract In this paper we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed. 1 Introduction In the last decade the field of Natural Language Processing NLP has seen a surge in the use of corpus motivated techniques. Several NLP systems are modeled based on empirical data and have had varying degrees of success. Of late however corpusbased techniques seem to have reached a plateau in performance. Three possible areas for future research investigation to overcoming this plateau include 1. Working with large amounts of data Banko and Brill 2001 2. Improving semi-supervised and unsupervised algorithms. 3. Using more sophisticated feature functions. The above listing may not be exhaustive but it is probably not a bad bet to work in one of the above directions. In this paper we investigate the first two avenues. Handling terabytes of data requires more efficient algorithms than are currently used in NLP. We propose a web scalable solution to clustering nouns which employs randomized algorithms. In doing so we are going to explore the literature and techniques of randomized algorithms. All clustering algorithms make use of some distance similarity . cosine similarity to measure pair wise distance between sets of vectors. Assume that we are given n points to cluster with a maximum of k features. Calculating the full similarity matrix would take time complexity n2k. With large amounts of data say n in the order of millions or even billions .
đang nạp các trang xem trước