Random sampling techniques for space efficient online computation of order statistics of large datasets
- 1 June 1999
- journal article
- conference paper
- Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record
- Vol. 28 (2) , 251-262
- https://doi.org/10.1145/304181.304204
Abstract
In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper.First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length.Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.Keywords
This publication has 7 references indexed in Scilit:
- Random sampling for histogram constructionPublished by Association for Computing Machinery (ACM) ,1998
- New sampling-based summary statistics for improving approximate query answersPublished by Association for Computing Machinery (ACM) ,1998
- Approximate medians and other quantiles in one pass and with limited memoryACM SIGMOD Record, 1998
- Improved histograms for selectivity estimation of range predicatesPublished by Association for Computing Machinery (ACM) ,1996
- Random sampling with a reservoirACM Transactions on Mathematical Software, 1985
- Access path selection in a relational database management systemPublished by Association for Computing Machinery (ACM) ,1979
- Time bounds for selectionJournal of Computer and System Sciences, 1973