On computing correlated aggregates over continual data streams
- 1 May 2001
- journal article
- conference paper
- Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record
- Vol. 30 (2) , 13-24
- https://doi.org/10.1145/376284.375665
Abstract
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.Keywords
This publication has 9 references indexed in Scilit:
- Mining high-speed data streamsPublished by Association for Computing Machinery (ACM) ,2000
- EddiesPublished by Association for Computing Machinery (ACM) ,2000
- Random sampling techniques for space efficient online computation of order statistics of large datasetsPublished by Association for Computing Machinery (ACM) ,1999
- BOAT—optimistic decision tree constructionPublished by Association for Computing Machinery (ACM) ,1999
- Ripple joins for online aggregationPublished by Association for Computing Machinery (ACM) ,1999
- The Space Complexity of Approximating the Frequency MomentsJournal of Computer and System Sciences, 1999
- Data networks as cascadesPublished by Association for Computing Machinery (ACM) ,1998
- Approximate medians and other quantiles in one pass and with limited memoryPublished by Association for Computing Machinery (ACM) ,1998
- Online aggregationPublished by Association for Computing Machinery (ACM) ,1997