This week, I visited Afonso Bandeira at NYU to give a talk in the MaD seminar on the semidefinite relaxation of k-means. Here are the slides. The last part of the talk is very new; I worked it out with Soledad Villar while she visited me a couple weeks ago, and our paper just hit the arXiv. In this blog entry, I’ll briefly summarize the main idea of the paper.
Suppose you are given data points , and you are tasked with finding the partition
that minimizes the k-means objective
(Here, we normalize the objective by for convenience later.) To do this, you will likely run MATLAB’s built-in implementation of k-means++, which randomly selects
of the data points (with an intelligent choice of random distribution), and then uses these data points as proto-centroids to initialize Lloyd’s algorithm. In practice, this works very well: After running it a few times, you generally get a very nice clustering. But when do you know to stop looking for an even better clustering?
Not only does k-means++ work well in practice, it comes with a guarantee: The initial clustering has random k-means value such that
As such, you can compute the initial value of k-means++ for multiple trials to estimate this lower bound and produce an approximation ratio of sorts. Unfortunately, this ratio can be rather poor. For example, running k-means++ on the MNIST training set of 60,000 handwritten digits produces a clustering of value 39.22, but the above lower bound is about 2.15. So, who knows? Perhaps there’s another clustering out there that’s 10 times better! Actually, there isn’t, and our paper provides a fast algorithm to demonstrate this.
What you’d like to do is solve the k-means SDP, that is, minimize
where is the
matrix whose
th entry is
. Indeed,
since is feasible in
with the same value as
in
. Unfortunately, solving the SDP is far slower than k-means++, and so another idea is necessary.
As an alternative, select small and draw
uniformly from
. Then it turns out (and is not hard to show) that
As such, one may quickly compute independent instances of and then conduct an appropriate hypothesis test to obtain a high-confidence lower bound on
. With this, you can improve k-means++’s MNIST lower bound from 2.15 to around 37. Furthermore, for a mixture of Gaussians, the size of
depends only on
and
, rather than the number of data points. In particular, if you have more than a million points (say), you can use our method to compute a good lower bound faster than k-means++ can even cluster. (!)
For me, one exciting thing about this new lower bound is that it might be useful for quickly estimating the best choice for k, i.e. look for an elbow in the lower bounds, and hope that it’s near the elbow in the real cluster scores.
Yes, this is something Afonso mentioned to me just yesterday. I suspect a theorem can be proved along these lines…