Join-distinct aggregate estimation over update streams

Dimensions

Ganguly, Sumit ; Garofalakis, Minos ; Kumar, Amit ; Rastogi, Rajeev (2005) Join-distinct aggregate estimation over update streams In: PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 13 - 15, 2005, Baltimore Maryland.

Full text not available from this repository.

Official URL: http://doi.org/10.1145/1065167.1065200

Related URL: http://dx.doi.org/10.1145/1065167.1065200

Abstract

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. Providing (perhaps approximate) answers to queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. The ability to estimate the number of distinct (sub)tuples in the result of a join operation correlating two data streams (i.e., the cardinality of a projection with duplicate elimination over a join) is an important requirement for several data-analysis scenarios. For instance, to enable real-time traffic analysis and load balancing, a network-monitoring application may need to estimate the number of distinct (source, destination) IP-address pairs occurring in the stream of IP packets observed by router R1, where the source address is also seen in packets routed through a different router R2. Earlier work has presented solutions to the individual problems of distinct counting and join-size estimation (without duplicate elimination) over streams. These solutions, however, are fundamentally different and extending or combining them to handle our more complex "Join-Distinct" estimation problem is far from obvious. In this paper, we propose the first space-efficient algorithmic solution to the general Join-Distinct estimation problem over continuous data streams (our techniques can actually handle general update streams comprising tuple deletions as well as insertions). Our estimators are probabilistic in nature and rely on novel algorithms for building and combining a new class of hash-based synopses (termed "JD sketches") for individual update streams. We demonstrate that our algorithms can provide low error, high-confidence Join-Distinct estimates using only small space and small processing time per update. In fact, we present lower bounds showing that the space usage of our estimators is within small factors of the best possible for the Join-Distinct problem. Preliminary experimental results verify the effectiveness of our approach.

Item Type:	Conference or Workshop Item (Paper)
Source:	Copyright of this article belongs to Association for Computing Machinery.
ID Code:	123550
Deposited On:	30 Sep 2021 10:24
Last Modified:	30 Sep 2021 10:24

Repository Staff Only: item control page

PlumX Metrics