# Monte Carlo approximation certificates for k-means clustering

This week, I visited Afonso Bandeira at NYU to give a talk in the MaD seminar on the semidefinite relaxation of k-means. Here are the slides. The last part of the talk is very new; I worked it out with Soledad Villar while she visited me a couple weeks ago, and our paper just hit the arXiv. In this blog entry, I’ll briefly summarize the main idea of the paper.

Suppose you are given data points $\{x_i\}_{i\in T}\subseteq\mathbb{R}^m$, and you are tasked with finding the partition $C_1\sqcup\cdots\sqcup C_k=T$ that minimizes the k-means objective

$\displaystyle{\frac{1}{|T|}\sum_{t\in[k]}\sum_{i\in C_t}\bigg\|x_i-\frac{1}{|C_t|}\sum_{j\in C_t}x_j\bigg\|^2\qquad(T\text{-IP})}$

(Here, we normalize the objective by $|T|$ for convenience later.) To do this, you will likely run MATLAB’s built-in implementation of k-means++, which randomly selects $k$ of the data points (with an intelligent choice of random distribution), and then uses these data points as proto-centroids to initialize Lloyd’s algorithm. In practice, this works very well: After running it a few times, you generally get a very nice clustering. But when do you know to stop looking for an even better clustering?

Not only does k-means++ work well in practice, it comes with a guarantee: The initial clustering has random k-means value $W$ such that

$\displaystyle{\mathrm{val}(T\text{-IP})\geq \frac{1}{8(\log k+2)}\cdot \mathbb{E}W.}$

As such, you can compute the initial value of k-means++ for multiple trials to estimate this lower bound and produce an approximation ratio of sorts. Unfortunately, this ratio can be rather poor. For example, running k-means++ on the MNIST training set of 60,000 handwritten digits produces a clustering of value 39.22, but the above lower bound is about 2.15. So, who knows? Perhaps there’s another clustering out there that’s 10 times better! Actually, there isn’t, and our paper provides a fast algorithm to demonstrate this.

What you’d like to do is solve the k-means SDP, that is, minimize

$\displaystyle{\frac{1}{2|T|}\mathrm{tr}(DX)\quad\text{subject to}\quad X1=1,~\mathrm{tr}(X)=k,~X\geq0,~X\succeq0\qquad(T\text{-SDP})}$

where $D$ is the $T\times T$ matrix whose $(i,j)$th entry is $\|x_i-x_j\|^2$. Indeed,

$\mathrm{val}(T\text{-SDP})\leq\mathrm{val}(T\text{-IP})$

since $X=\sum_{t\in[k]}\frac{1}{|C_t|}1_{C_t}1_{C_t}^\top$ is feasible in $(T\text{-SDP})$ with the same value as $\{C_t\}_{t\in[k]}$ in $(T\text{-IP})$. Unfortunately, solving the SDP is far slower than k-means++, and so another idea is necessary.

As an alternative, select $s$ small and draw $S$ uniformly from $\binom{T}{s}$. Then it turns out (and is not hard to show) that

$\mathbb{E}\mathrm{val}(S\text{-SDP})\leq\mathrm{val}(T\text{-IP}).$

As such, one may quickly compute independent instances of $\mathrm{val}(S\text{-SDP})$ and then conduct an appropriate hypothesis test to obtain a high-confidence lower bound on $\mathbb{E}\mathrm{val}(S\text{-SDP})$. With this, you can improve k-means++’s MNIST lower bound from 2.15 to around 37. Furthermore, for a mixture of Gaussians, the size of $s$ depends only on $m$ and $k$, rather than the number of data points. In particular, if you have more than a million points (say), you can use our method to compute a good lower bound faster than k-means++ can even cluster. (!)

## 2 thoughts on “Monte Carlo approximation certificates for k-means clustering”

1. Jake says:

For me, one exciting thing about this new lower bound is that it might be useful for quickly estimating the best choice for k, i.e. look for an elbow in the lower bounds, and hope that it’s near the elbow in the real cluster scores.

1. Yes, this is something Afonso mentioned to me just yesterday. I suspect a theorem can be proved along these lines…