tailieunhanh - Approximate String Joins in a Database (Almost) for Free

ProDom(Corpet et al., 2000) is one of the earliest clustered protein family databases and continually updates its methods and services. Currently, it coordinates some of its larger entries with Pfam-A and uses PSI-BLAST to cluster the remaining sequences in SWISS-PROT and TrEMBL. While only large entries have been scrutinized manually, the consistency of all families is assessed by computing a series of numerical measurements. The resulting families are represented as consensus sequences and gapped multiple alignments. Phylogenetic trees are computed from these alignments and used to display a family in overlapped subfamilies based on distances in the tree. Documentation consists of links to the protein sequence databases and to other protein family databases (PROSITE,. | Approximate String Joins in a Database Almost for Free Luis Gravano Columbia University gravano@ Nick Koudas AT T Labs-Research koudas@ Panagiotis G. Ipeirotis Columbia University pirot@ S. Muthukrishnan AT T Labs-Research muthu@ H. V. Jagadish University of Michigan jag@ Divesh Srivastava AT T Labs-Research divesh@ Abstract String data is ubiquitous and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due for example to the prevalence of typographical errors in data and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly and it is a challenge to implement this functionality efficiently with user-defined functions UDFs . In this paper we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core our technique relies on matching short substrings of length q called g-grams and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching with a variety of possible edit distance functions. The approximate string match predicate with a suitable edit distance threshold can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs using commercial database systems and real data. To study the I O and CPU behavior of approximate string join algorithms with variations in edit distance and -gram length we also describe detailed experiments based on a prototype implementation. Permission to copy .