Sarawagi, Sunita ; Bhamidipaty, Anuradha (2002) Interactive deduplication using active learning SIGKDD Explorations . p. 269. ISSN 1931-0145
PDF
348kB |
Official URL: http://doi.org/10.1145/775047.775087
Related URL: http://dx.doi.org/10.1145/775047.775087
Abstract
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
Item Type: | Article |
---|---|
Source: | Copyright of this article belongs to Association for Computing Machinery |
ID Code: | 128419 |
Deposited On: | 20 Oct 2022 09:14 |
Last Modified: | 20 Oct 2022 09:14 |
Repository Staff Only: item control page