On the low-rank approach for semidefinite programs arising in synchronization and community detection

From MaxCut to PhaseLift, semidefinite programming has proven to be rather powerful, especially for convex relaxation. SDP solvers take polynomial time, but the exponent is large, and anyone who’s run an SDP on CVX has experienced some frustration with the runtime. In practice, the SDP-optimal matrix tends to have extremely low rank, and so one may apply a rank constraint to facilitate the search for the SDP’s solution. This heuristic was first introduced by Burer and Monteiro, and it works well in practice, but the rank-constrained program is nonconvex and the theory is scant. Recently, the theory gap started to close with this paper:

On the low-rank approach for semidefinite programs arising in synchronization and community detection

Afonso S. Bandeira, Nicolas Boumal, Vladislav Voroninski

As the title suggests, this paper provides strong performance guarantees for the Burer-Monteiro heuristic in the particular cases of \mathbb{Z}_2 synchronization and community detection. I was very excited to see this paper, and so I interviewed one of the authors (Nicolas Boumal). I’ve lightly edited his responses for formatting and hyperlinks:

DGM: The Burer-Monteiro heuristic for solving SDPs has been around for well over a decade. Why has it taken so long for the theory to catch up? Are the techniques you leverage particularly new?

NB: One part may be that the original SDPLR papers by Burer-Monteiro aimed at solving general SDP’s with few equality constraints (even though the numerical experiments focused on a few specific types of problems, among which the Max-Cut SDP shines particularly). The rank-2 phenomenon that we observe in the paper is not general. So perhaps the first audience of these papers was not that interested in this result.

As for execution, we rely mainly on two existing assets. Both are fairly recent.

For insight, we use the 2010 paper by Journée, Bach, Absil, and Sepulchre, who observed the importance (both algorithmic and theoretical) of the smooth manifold structure of the rank restricted search space; that is: when the convex search space of the SDP is restricted to matrices of bounded rank, the resulting (nonconvex) set can be parameterized by a smooth manifold. This is not true for general SDP’s, but it is true here, and we believe this is a great facilitator. (It is perhaps even required in some sense, but it is too soon to tell.) Of course, in our setting, the manifold turns out to be particularly simple (it is a product of circles), and it doesn’t take a lot of technical know-how to exploit this structure; it was used to some extent before 2010 of course. But I do believe that framing the nonconvex optimization problem in the larger setting of optimization on manifolds, despite the simplicity of the specific manifold at hand, has helped us get better intuition; this wider vision we credit to Journée et al. We also think this can help identify a more general setup in which SDPLR is likely to work well; something that is missing for now.

For the more technical steps, such as establishing strong correlation of the ground truth with second-order critical points, we relied mostly on the proof program successful in a previous 2014 paper where we show tightness of the SDP relaxation for phase synchronization, and also more recently in 2016 for a nonconvex approach to that one: it is the same SDP as the one we treat here, but with complex numbers. It is interesting that the same proof ingredients work both to study tightness of the SDP and to establish that nonconvex approaches to solving it can be successful.

DGM: Your guarantees give conditions under which all second-order critical points are good-enough solutions to \mathbb{Z}_2 synchronization or community detection. What are the underlying proof ideas for these results?

NB: The nonconvex search space is the manifold of matrices Q of size n \times 2 whose rows each have unit norm. This indeed ensures that QQ^\top is positive semidefinite with unit diagonal: the constraints of the SDP; and also that the rank is bounded by 2.

Assuming Q further satisfies first- and second-order optimality conditions, and calling z the ground truth signal, we wanted to show that \|Q^\top z\|_F is large (much larger than what one would get by random chance). The necessary optimality conditions for the optimization problem on the manifold are simple: the Riemannian (or projected) gradient of the cost function f(Q) = \mathrm{Trace}(A Q Q^\top) must vanish (first order), and the Riemannian Hessian must be negative semidefinite since we are maximizing (second order). One thing which is clear is that both conditions must be used to reach a strong conclusion. The manifold structure makes these conditions particularly simple to obtain and manipulate. (In all generality, second-order KKT conditions are rather painful to work with…)

We start by showing that second-order conditions alone already imply some (weak) correlation. One way to understand this is that the second-order condition, even though it is purely local, turns into a more global statement in the following way. By Lemma 7, if Q satisfies second-order conditions, \mathrm{ddiag}(AQQ^\top)\succeq A \circ QQ^\top, where A is the cost matrix. Take the inner product of this matrix inequality with any matrix whose diagonal is \mathbf{1} (this is the case for all feasible points of the SDP), and you get an inequality of the type f(Q) \geq \ldots. This readily implies that the cost at Q is not too bad, and hence that Q must correlate with the dominant component of A, namely, zz^\top.

The measurement model is A = zz^\top + E, where z is the ground truth vector and E is the perturbation. For \mathbb{Z}_2-synchronization, the perturbation E is assumed to be unrelated to z and small enough, so the dominant eigenvector of A relates to z and the above explanation works out easily. For community detection on the other hand, the perturbation E relates to z and could have large operator norm. It is then important to recognize that the components of E which grossly violate the constraints on Q can easily be discarded, as described in the paper. This is done automatically by the SDP, and also by the nonconvex program.

Then, to further obtain truly strong correlation of the second-order critical points Q with z, we show that points which satisfy second-order conditions and first-order conditions must be close to rank 1. When they are exactly rank 1, they are globally optimal, as we show in the exact recovery parts of the paper. Being approximately rank 1 turns out to already help control for high correlation.

DGM: Your nontrivial correlation–type results require the SNR \lambda to be bigger than 8, and your exact-recovery result under the stochastic block model requires SNR \tilde{\lambda}(p,q) to scale like n^{1/3}. What do you think are the true transitions, and why is it difficult to take the theory down to these transitions?

NB: The n^{1/3} term for the exact recovery result is definitely a limitation of our proof technique. It is the same limitation that arises in this and that paper, mentioned earlier. In the present paper, the culprit is isolated in Lemma 13. Numerical experiments both for the exact recovery result and for the isolated lemma suggest that the real transition should be a polylog function. In our proofs, we only ever use two properties of the perturbation matrix E, namely, a bound on the value of the SDP if the cost matrix is replaced by E (which can be replaced by the operator norm of E for \mathbb{Z}_2-synchronization) and a bound on the \ell_\infty-norm of the alignment of E with the ground truth z, that is, \|E z\|_\infty. But in our experiments, noise is generated following a much more structured random model. Probably, the true transitions can only be established if the random structure is exploited more finely.

As for community detection in the stochastic block model with constant average degree, it is already known that for \lambda < 1 it is not possible to get nontrivial correlation with the ground truth (this is the information theoretic limit). It is also known, and these are recent results (see here and here), that the SDP does provide nontrivial correlation as soon as \lambda > 1. It is quite remarkable that there is no gap there. Now, it is tempting to believe that the Burer-Monteiro approach should work when the SDP works… But our current proof falls a bit short of that, and requires \lambda \geq 8. Our sentiment at this stage is that it might be reasonably easy to reduce the constant to 4, or maybe even 2 with more care in the inequalities. But getting all the way down to 1 may prove to be a bigger challenge.

DGM: Are there any barriers to applying your proof ideas to other SDPs?

NB: For now, our analysis relies rather strongly on the manifold structure of the rank-restricted search space. While this structure is present in a number of important SDP’s, it is definitely not a mild condition. This also means we cannot accommodate inequality constraints at the moment. In practice, methods such as the original SDPLR algorithm by Burer and Monteiro can be applied even without the manifold structure, but there are no guarantees. Empirical results are also mixed, so that it is not clear if this is a limitation of the proof techniques or of the algorithms, or a real limitation. Down the line, we would also like to handle nonlinear cost functions, but then we lose the guarantee that there will be an optimal extreme point. That’s problematic, because the most general tool we know to control the rank of optimal solutions is the Pataki-Barvinok theorems, which control the rank of extreme points. For applications where there are other reasons for the solutions to be low-rank nonetheless (as is the case here), extra work is needed. A number of things have to line up well for this to be within reach.

On the bright side, these results should extend rather directly to synchronization of rotations in \mathbb{R}^d, which is similar to \mathbb{Z}_2 synchronization but with the constraint that diagonal blocks of X are identity matrices of size d. This paper gives deterministic results for the low-rank approach to work there. The deterministic result gets nowhere near allowing for a mere rank d+1 relaxation, but numerical experiments with a Gaussian noise model show that this is indeed sufficient even for large noise, and I expect the proof techniques laid out in the present paper to extend to that setting as well. (In fact, we know that most parts do.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s