tailieunhanh - A Linear-Time Probabilistic Counting Algorithm for Database Applications

There is no fundamental reason that a transaction must abort as aresultofanynondeterministicevent;whensystemsdochoose to abort transactions due to outside events, it is due to practical consideration. After all, forcing all other nodes in a system to wait for the node that experienced a nondeterministic event (such as a hardware failure) to recover could bring a system to a painfully long stand-still. If there is a replica node performing the exact same operations in parallel to a failed node, however, then other nodes that depend on communication with the afflicted node to execute a transaction need not wait for the failed node to recover back to its original state—rather they can make requests. | A Linear-Time Probabilistic Counting Algorithm for Database Applications KYU-YOUNG WHANG Korea Advanced Institute of Science and Technology BRAD T. VANDER-ZANDEN Cornell University and HOWARD M. TAYLOR University of Delaware We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has 0 q time complexity where q is the number of values including duplicates and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally accurate counts of unique values were obtained by sorting which has 0 q log q time complexity. Our technique called linear counting is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result A load factor number of unique values hash table size much larger than . 12 can be used for accurate estimation . 1 of error . We present this technique with two important applications to database problems namely 1 obtaining the column cardinality the number of unique values in a column of a relation and 2 obtaining the join selectivity the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to be joined . These two parameters are important statistics that are used in relational query optimization and physical database design. Categories and Subject Descriptors Mathematics of Computing Probability and Statistics probabilistic algorithms Mathematics of Computing Mathematical Software algorithm analysis Database Management Physical Design Database Management Systems query processing General Terms Algorithms Experimentation Performance Theory Additional Key Words and Phrases Bit map column cardinality hashing join selectivity physical database design query optimization statistical databases 1. INTRODUCTION The number of .

TỪ KHÓA LIÊN QUAN