Clustering noisy data with semidefinite relaxations

Soledad Villar recently posted our latest paper on the arXiv (this one coauthored by her advisor, Rachel Ward). The paper provides guarantees for the k-means SDP when the points are drawn from a subgaussian mixture model. This blog entry will discuss one of the main ideas in our analysis, which we borrowed from Guedon and Vershynin’s recent paper.

Let’s start with two motivating applications:

The first application comes from graph clustering. Consider the stochastic block model, in which the n vertices are secretly partitioned into two communities, each of size n/2, and edges between vertices of a common community are drawn iid with some probability p, and all other edges are drawn with probability q<p. The goal of community estimation is to estimate the communities given a random draw of the graph. For this task, you might be inclined to find the maximum likelihood estimator for this model, but this results in an integer program. Relaxing the program leads to a semidefinite program, and amazingly, this program is tight and recovers the true communities with high probability when p=(\alpha\log n)/n and q=(\beta\log n)/n for good choices of (\alpha,\beta). (See this paper.) These edge probabilities scale like the threshold for connected Erdos-Renyi graphs, and this makes sense since we wouldn’t know how to assign vertices in isolated components. If instead, the probabilities were to scale like 1/n, then we would be in the “giant component” regime, so we’d still expect enough signal to correctly assign a good fraction of the vertices, but the SDP is not tight in this regime.

As a second application, consider geometric clustering with the k-means objective. This can be formulated as an integer program, which can in turn be relaxed to an SDP. Recently, two papers (one and two) have investigated the performance of this SDP under the so-called stochastic ball model, which draws points from a mixture of different translates of a rotation-invariant distribution. In particular, if the distribution has compact support and the different translates have at least a little space between their supports, then the SDP relaxation recovers the planted clustering with high probability (here, exactly how little the space needs to be remains an open problem). Unfortunately, real-world data may not match such a nice model. To tackle messier data, one might study the performance of the SDP under a Gaussian mixture model, but the SDP is not tight for this model.

In both applications, the optimization has the following form:

maximize \langle data , z\rangle subject to z feasible

The problem is that when the data is noisy, the SDP relaxation fails to be tight. While this phenomenon may seem foreign to folks who study compressed sensing or phase retrieval, it is quite common in operations research (e.g., try solving the travelling salesman problem). The fix is to round to an integer solution, which in turn produces an approximation ratio: The optimal integer solution may be better than the rounded solution, but it’s definitely no better than the relaxed solution. As such, you have an approximation of the form

val(SDP) \approx val(IP)

However, our clustering tasks ask for a different sort of approximation guarantee: We want our rounded solution (clustering) to be close in some sense to the planted clustering. Since our rounded solution comes from the relaxed solution, and since we know the SDP recovers the planted clustering when the data is less noisy, this means we actually want the following sort of Lipschitz-type approximation:

arg(SDP with noisy data) \approx arg(SDP with less noisy data)

The only guarantee of this sort that I know of comes from Guedon and Vershynin’s paper, which analyzes minimum bisection SDP under the stochastic block model in the regime where p and q scale like 1/n. In this blog entry, I boil down the main ideas so as to facilitate its use in future applications. (Also, this captures the main ideas in the approximation portion of our k-means analysis.)

Let S be a convex set in some Euclidean space \mathbb{R}^d, and for each linear objective x\in\mathbb{R}^d, put

\displaystyle{\hat{x}:=\arg\max_{z\in S}\langle x,z\rangle}.

We’d like to show that if x and y are close in some S-specific sense, then \hat{x} is close to \hat{y}. Here, x represents noisy data, y is less noisy, and so \hat{x} is the solution we get, whereas \hat{y} is the planted solution we wish we got. It turns out that the following norm provides a useful notion of distance:

\displaystyle{\|x\|_S:=\max_{z\in S}|\langle x,z\rangle|}.

This is called the support function of the convex hull of S\cup -S. It’s a norm for the span of S, and it’s the dual of the atomic norm of S\cup-S. What follows is the approximation we use:

Theorem (cf. Guedon–Vershynin). For every y\in S with \|y\|_2\geq\|\hat{x}\|_2, we have

\displaystyle{\|\hat{x}-y\|_2^2\leq 4\inf_{c>0}\|cx-y\|_S}.

Strangely, this is somewhat different from the Lipschitz-type property we wanted: The result states that \hat{x} is close to y in Euclidean norm if some scalar multiple of x is close to y in S-norm (and if y is more “geometrically extreme” than x). In both applications, geometric extremeness holds whenever y is integral, and this makes intuitive sense, since this is saying that y lies in a more pointy portion of S than does x (assuming S is centered at zero, which it certainly is if we pass to \mathrm{hull}(S\cup-S)). It’s instructive to see how this bound behaves in the cases where S is the unit 1- or 2-ball in two dimensions. In both cases, the bound is shockingly loose when x and y are close. As such, it may be interesting to investigate an alternative bound that better captures the left-hand side in the small-deviation regime. Regardless, this bound produces good results for both graph clustering and k-means clustering. In these applications, it is convenient to bound the right-hand side with a norm that is more accessible by random matrix–type methods (e.g., Guedon and Vershynin apply a PSD version of Grothendieck’s inequality to pass to the induced \infty\rightarrow1 norm, and then apply a union-bound estimate).

Proof of Theorem: We have

\begin{aligned} \|\hat{x}-y\|_2^2&=\|\hat{x}\|_2^2+\|y\|_2^2-2\langle\hat{x},y\rangle\\&=\|\hat{x}\|_2^2-\|y\|_2^2-2\langle\hat{x}-y,y\rangle\\&\leq 2\langle y-\hat{x},y\rangle,\qquad (*) \end{aligned}

where the last step follows from the hypothesis. Pick c>0. Then \hat{x}\in S implies

\langle cx,\hat{x}\rangle-\langle y,\hat{x}\rangle\leq|\langle cx-y,\hat{x}\rangle|\leq\|cx-y\|_S,

and similarly y\in S implies

-\langle cx,y\rangle+\langle y,y\rangle\leq|\langle cx-y,y\rangle|\leq\|cx-y\|_S.

Addition then gives

2\|cx-y\|_S\geq\langle cx,\hat{x}\rangle-\langle cx,y\rangle+\langle y,y\rangle-\langle y,\hat{x}\rangle.

Next, \hat{x}=\widehat{cx} and y\in S, and so \langle cx,\hat{x}\rangle\geq\langle cx,y\rangle. Applying this to the above and combining with (*) then gives the result.     \Box

One thought on “Clustering noisy data with semidefinite relaxations

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s