Approximate Clustering with Same-Cluster Queries

Ailon, Nir ; Bhattacharya, Anup ; Jaiswal, Ragesh ; Kumar, Amit (2018) Approximate Clustering with Same-Cluster Queries In: 9th Innovations in Theoretical Computer Science Conference (ITCS 2018).

Full text not available from this repository.

Official URL: http://drops.dagstuhl.de/opus/volltexte/2018/8335

Abstract

Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm.

Item Type:	Conference or Workshop Item (Paper)
Source:	Copyright of this article belongs to Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik.
Keywords:	k-Means, Semi-Supervised Learning, Query Bounds.
ID Code:	123506
Deposited On:	29 Sep 2021 09:00
Last Modified:	29 Sep 2021 09:00

Repository Staff Only: item control page