Sarawagi, Sunita ; Kirpal, Alok (2004) Efficient set joins on similarity predicates In: 2004 ACM SIGMOD international conference on Management of data.
PDF
273kB |
Official URL: http://doi.org/10.1145/1007568.1007652
Related URL: http://dx.doi.org/10.1145/1007568.1007652
Abstract
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Source: | Copyright of this article belongs to ACM, Inc |
ID Code: | 128407 |
Deposited On: | 20 Oct 2022 06:28 |
Last Modified: | 14 Nov 2022 11:34 |
Repository Staff Only: item control page