Katariya, Namit ; Iyer, Arun ; Sarawagi, Sunita (2012) Active Evaluation of Classifiers on Large Datasets In: IEEE 12th International Conference on Data Mining, Brussels, Belgium.
PDF
661kB |
Official URL: http://doi.org/10.1109/ICDM.2012.161
Related URL: http://dx.doi.org/10.1109/ICDM.2012.161
Abstract
The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Source: | Copyright of this article belongs to IEEE |
ID Code: | 128357 |
Deposited On: | 19 Oct 2022 10:25 |
Last Modified: | 14 Nov 2022 10:49 |
Repository Staff Only: item control page