Efficient set joins on similarity predicates

Sarawagi, Sunita ; Kirpal, Alok (2004) Efficient set joins on similarity predicates In: 2004 ACM SIGMOD international conference on Management of data.

PDF
273kB

Official URL: http://doi.org/10.1145/1007568.1007652

Related URL: http://dx.doi.org/10.1145/1007568.1007652

Abstract

In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

Item Type:	Conference or Workshop Item (Paper)
Source:	Copyright of this article belongs to ACM, Inc
ID Code:	128407
Deposited On:	20 Oct 2022 06:28
Last Modified:	14 Nov 2022 11:34

Repository Staff Only: item control page