Last year, the Supreme Court found NC-1 and NC-12 (the first two districts above) to be the result of unconstitutional racial gerrymandering. We’ll discuss IL-4 (the district on the right) later.

The Court is currently evaluating a proposal for systematically detecting unconstitutional * partisan gerrymandering*, where the gerrymander significantly benefits one party over the other. The proposed measure, called the

Our paper proves that in some cases, it’s impossible to get a small efficiency gap without drawing bizarrely shaped districts. Suppose for example that the voters are distributed as follows:

Here, each 3-by-3 square contains 5 blue voters and 4 red voters. Now suppose you are tasked with drawing 5 districts for this region. You might decide to ignore the voters’ preferences and run the shortest splitline algorithm. Doing so would produce the following districts:

In this case, the districts don’t exhibit irregular shape, but blue wins every single district—even though red makes up 44% of the vote! The efficiency gap here is a whopping 38% in favor of blue. Alternatively, you could hunt for clusters of red to draw new districts like the following:

The efficiency gap is now only 2% in favor of blue, but at the price of a bizarre-looking district that the Washington Post might criticize. Our main result establishes that this is a fundamental tradeoff between shape and symmetry that can’t be removed.

But is this a problem? Let’s return to IL-4. While the bizarre shape certainly suggests that the map maker had intentions, it doesn’t necessarily demonstrate bad intentions. In this case, the district was “gerrymandered” to connect two majority Hispanic parts of Chicago so as to provide a common voice to this demographic (the region between these “earmuffs” is majority African-American). Likewise, since “partisan symmetry” might be an ideal worth pursuing, our result suggests that geometry probably shouldn’t have the final say in the gerrymandering debate. (This conclusion isn’t new, by the way; see for example John Oliver’s take on the issue, which might not be safe for work.)

]]>

Suppose you are given data points , and you are tasked with finding the partition that minimizes the k-means objective

(Here, we normalize the objective by for convenience later.) To do this, you will likely run MATLAB’s built-in implementation of k-means++, which randomly selects of the data points (with an intelligent choice of random distribution), and then uses these data points as proto-centroids to initialize Lloyd’s algorithm. In practice, this works very well: After running it a few times, you generally get a very nice clustering. But when do you know to stop looking for an even better clustering?

Not only does k-means++ work well in practice, it comes with a guarantee: The initial clustering has random k-means value such that

As such, you can compute the initial value of k-means++ for multiple trials to estimate this lower bound and produce an approximation ratio of sorts. Unfortunately, this ratio can be rather poor. For example, running k-means++ on the MNIST training set of 60,000 handwritten digits produces a clustering of value 39.22, but the above lower bound is about 2.15. So, who knows? Perhaps there’s another clustering out there that’s 10 times better! Actually, there isn’t, and our paper provides a fast algorithm to demonstrate this.

What you’d like to do is solve the k-means SDP, that is, minimize

where is the matrix whose th entry is . Indeed,

since is feasible in with the same value as in . Unfortunately, solving the SDP is far slower than k-means++, and so another idea is necessary.

As an alternative, select small and draw uniformly from . Then it turns out (and is not hard to show) that

As such, one may quickly compute independent instances of and then conduct an appropriate hypothesis test to obtain a high-confidence lower bound on . With this, you can improve k-means++’s MNIST lower bound from 2.15 to around 37. Furthermore, for a mixture of Gaussians, the size of depends only on and , rather than the number of data points. In particular, if you have more than a million points (say), you can use our method to compute a good lower bound faster than k-means++ can even cluster. (!)

]]>

Let act on the finite set . This determines an action of on , and each orbit of this action corresponds to an matrix of zeros and ones supported on . These matrices span the vector space of **-stable matrices**, that is, the complex matrices such that for all and . If acts transitively on , then the adjacency matrices form a (not necessarily commutative) association scheme. Indeed, one of the adjacency matrices is identity since acts transitively on . Next, the orbits partition , and so the adjacency matrices sum to the all-ones matrix. The orbit of is the transpose of the orbit of . Finally, -stable matrices are closed under multiplication, regardless of whether the action is transitive (just write out the entry of the product and perform a change of variables). The association schemes that arise in this way are called * Schurian*, and if the scheme is commutative, we say is a

The spectral theorem affords any commutative association scheme with another useful basis: the orthogonal projections onto the common eigenspaces of the members of . Furthermore, every orthogonal projection in can be expressed as a sum of these “primitive” projections, meaning there are only finitely many Gram matrices of tight frames in to consider. In the case of a Schurian scheme, the primitive projections can be expressed in terms of the characters of . In fact, one may construct projection matrices from the characters even in the noncommutative case, but they will no longer form a basis for .

In the past, folks have constructed so-called * group frames* by spinning a vector with the representation of some group . If fixes under this action, then you might reduce to a frame of size by removing the copies, resulting in a

I’ll conclude with a brief description of our new infinite family of ETFs with Heisenberg symmetry. Let be a finite abelian group of odd order, and define to be the set , where is the exponent of . Then we may define multiplication in using a symplectic form over to obtain the Heisenberg group over . The * symplectic group* over is the group of automorphisms of that fix the symplectic form. In this paper, we prove that is a Gelfand pair, and we find Gram matrices of tight frames in the corresponding adjacency algebra whose projective reductions are ETFs. We conclude the paper with explicit constructions of the corresponding group frames. Again, we are very interested to find any relationship with SIC-POVMs, though it seems that any correspondence may require to be cyclic.

]]>

**DGM:** Judging by your website, this project in phase retrieval appears to be a departure from your coding theory background. How did this project come about?

**MM:** Many of the tools employed in information and coding theory are very general and they prove useful also to solve problems in other fields, such as, compressed sensing, machine learning or data analysis. So this is the general philosophy that motivated my “detour”.

The interest in the specific problem of weak recovery for phase retrieval came about for the following reason. The techniques that are typically used to derive bounds on the sample complexity of weak recovery are based on matrix concentration inequalities. This means that one computes the expected value of the data matrix and then shows that, when the number of samples is large enough, such a matrix concentrates around its mean. However, this procedure does not give tight bounds. So, we thought that we could compute the spectrum of the (random) data matrix and, consequently, obtain an exact bound. This bound is exact in the sense that it characterizes exactly the performance of the spectral method. At that point, we were wondering whether such a bound is information-theoretically optimal. This motivated us to prove a converse bound, which turned out to match the spectral upper bound.

**DGM:** I thought the information-theoretic lower bound of was obvious until I realized that weak recovery is possible in the linear case whenever . Do you have any intuition for this bound? What are the big ideas in the proof?

**MM:** Unfortunately, we do not have a good intuition of why the information-theoretic threshold is for the complex case and for the real case. Indeed, as you point out, weak recovery is possible in the linear model for any .

As for the proof, the basic idea consists in bounding the conditional entropy of the received vector via the second moment method. More formally, let be the received vector and the measurement matrix. Then, the goal is to evaluate . To do so, we compute the ratio between the second and the first moment of . In particular, we prove that for , the quantity

is sublinear in . As a result, the conditional entropy is equal to un-conditional one, which means that the received vector does not give information about the unknown signal. This also means that the error of the Bayes-optimal estimator is equal to the error of the trivial estimator that does not use the received signal at all.

**DGM:** What’s the intuition behind allowing to be negative? Your optimal choice of happens to be negative when is small. Are you somehow penalizing the directions that are nearly orthogonal to ?

**MM:** The optimal pre-processing function is given by the expression

Indeed, is negative when is small. As you suggested, the intuition is exactly that the points in which the measurement vector is basically orthogonal to the unknown signal are not informative, hence we penalize them.

This is quite different from the spectral methods that have been considered so far (see this, that, another). Earlier works employed pre-processing functions that (i) are positive and (ii) try to extract information from the large values of the data. On the contrary, especially when is close to , the function has a large negative part for small . Furthermore, it extracts useful information from the small values of the data.

**DGM:** Impressively, your spectral estimate works whenever . How do you leverage free probability along with Lu and Li’s result to demonstrate such a sharp bound?

**MM:** The analysis of the performance of the spectral method is based on the evaluation of the spectrum of the data matrix , where is a diagonal matrix that contains the vector on the diagonal. This computation boils down to the study of the spectrum of a matrix of the form , where has i.i.d. Gaussian entries and is independent of . Furthermore, the spectrum of a.s. converges weakly to a bulk, given by the probability law of , and its largest eigenvalue converges a.s. to a point outside the bulk. Now, also the spectrum of a.s. converges weakly to a bulk. The idea is that if the largest eigenvalue of converges a.s. to a point outside this bulk, then the principal eigenvector of is correlated with the unknown signal, which means that we have weak recovery.

If is PSD, then the spectrum of is studied in this paper of Bai and Yao. However, for to be PSD, we need that the pre-processing function is positive. This is the approach carried out in the paper by Lu and Li.

In order to remove this assumption, we decompose into a positive part minus a negative part . Both and are PSD, so we can apply the results of Bai and Yao. The free probability tools naturally come in at this point, as we need to study the spectrum of the sum of two random matrices.

**DGM:** The fact that the threshold in the real case is half the threshold in the complex case is reminiscent of how the injectivity thresholds are and in the real and complex cases, respectively. (See this, that, and another paper.) Do you have any intuition for this halving phenomenon in weak recovery?

**MM:** One way to think about this is as follows. In the complex case you have twice as many variables but the same amount of equations of the real case. Hence, it makes sense that the threshold for the complex case is two times the threshold for the real case.

This is only an heuristic argument and, at a rigorous level, we could not find a mapping between the complex problem and the real problem. Indeed, we have two (quite similar) proofs that handle separately the real and the complex case.

**DGM:** What’s next for this line of investigation? Is there any hope of establishing sharp thresholds for -weak recovery, where the desired correlation in equation (3) is prescribed?

**MM:** One possible future direction is certainly about studying -weak recovery. In this regard, I would like to mention the recent ArXiv submission by Barbier et al. There, the authors consider the real case and prove that the replica-symmetric formula from statistical physics gives an exact prediction. In this way, they obtain a single-letter formula for the mutual information and for the MMSE. However, it is still not proved that this lower bound can be met by a practical algorithm.

Perhaps an even more interesting direction consists in looking at other measurement matrices. Our analysis covers the case in which is Gaussian, while in practice people use, for example, Fourier measurement matrices. The rigorous study of Fourier measurement matrices seems definitively challenging, so one intermediate step in that direction could be to consider the case in which is unitary and Haar-distributed.

Let me also mention that we tried our spectral algorithm in a coded diffraction model, where is a Fourier matrix and the unknown signal is a digital image. Our algorithm significantly outperformed the existing spectral methods and, even more surprisingly, the theoretical predictions obtained for the Gaussian case matched quite well the numerical simulations. This is still a bit of a mystery for us, especially in consideration of the fact that the theoretical predictions worked well only for our optimal choice of the pre-processing function, while they failed for a different choice of the pre-processing function.

]]>

- Packings in real projective spaces, FoCM and SPIE
- Explicit restricted isometries, ILAS
- Probably certifiably correct k-means clustering, ILAS
- Equiangular tight frames from association schemes, SIAM AG17
- Open problems in finite frame theory, SIAM AG17

**UPDATE: SIAM AG17 just posted a video of my talk.**

Now for my favorite talks from FoCM, ILAS, SIAM AG17 and SPIE:

**Ben Recht — Understanding deep learning requires rethinking generalization**

In machine learning, you hope to fit a model so as to be good at prediction. To do this, you fit to a training set and then evaluate with a test set. In general, if a simple model fits a large training set pretty well, you can expect the fit to generalize, meaning it will also fit the test set. By conventional wisdom, if the model happens to fit the training set exactly, then your model is probably not simple enough, meaning it will not fit the test set very well. According to Ben, this conventional wisdom is wrong. He demonstrates this by presenting some observations he made while training neural nets. In particular, he allowed the number of parameters to far exceed the size of the training set, and in doing so, he fit the training set exactly, and yet he still managed to fit the test set well. He suggested that generalization was successful here because stochastic gradient descent implicitly regularizes. For reference, in the linear case, stochastic gradient descent (aka the randomized Kaczmarz method) finds the solution of minimal 2-norm, and it converges faster when the optimal solution has smaller 2-norm. Along these lines, Ben has some work to demonstrate that even in the nonlinear case, fast convergence implies generalization.

**Afonso Bandeira — The sample complexity of multi-reference alignment**

How do you reconstruct a function over given a collection of noisy translations of the function? Intuitively, you might use one of the noisy translations as a template and then try to find the translation of best fit for each of the others before averaging. Perhaps surprisingly, this fails miserably. For example, if you find the translation of best fit for a bunch of pure noise functions, then the average appears to approach the template, thereby demonstrating so-called model bias. Another approach is to collect translation-invariant features of the functions. For example, you can estimate the average value of the function with samples, the power spectrum with samples, and the bispectrum with samples. It turns out that the bispectrum determines generic functions up to translation, but is there an alternative that provides smaller sample complexity? Afonso’s main result here: No. In fact, there are two functions that are confusable unless you see enough samples. I wonder what sort of improvements can be made given additional structural information on the function.

**Vern Paulsen — Quantum chromatic numbers via operator systems**

Given a graph, color the vertices so that adjacent vertices receive distinct colors. The chromatic number of the graph is the smallest number of colors you need to accomplish this task. Here’s another way to phrase the coloring task: Put Alice and Bob in separate rooms, and simultaneously ask them the color of certain vertices. If the vertices you ask about are adjacent, Alice and Bob must report different colors. If the vertices are identical, they must report the same color. The chromatic number is the smallest number of colors for which Alice and Bob have a winning strategy without communicating. If you allow Alice and Bob access to a common random source, then this smallest number of colors does not change. However, if you allow them access to entangled particles, then the smallest number of colors frequently does change. This suggests a new graph invariant called the * quantum chromatic number*. Interestingly, the quantum version is sometimes much smaller and much easier to calculate than the classical version. For example, the Hadamard graph of parameter , the classical chromatic number is only known to be somewhere between and , whereas the quantum chromatic number is known to be exactly . Developing a quantum protocol for any given graph amounts to finding interesting arrangements of subspaces, which I think would appeal to the frame theory community.

**Hamid Javadi — Non-negative matrix factorization via archetypal analysis**

Given a collection of points in , how do you find a small number of “archetypes” such that each is close to the convex hull of the ‘s and each is close to the convex hull of the ‘s? This problem has a number of applications in data science, and if we further ask for the ‘s to be entrywise nonnegative, this is equivalent to the problem of nonnegative matrix factorization (NMF). A lot of work in NMF has used a generative model with a so-called * separability* assumption, which asks for each archetype to be one of the data points. Other work by Cutler and Breiman relaxed the separability assumption, merely asking for each archetype to lie in the convex hull of the data points. Unfortunately, these assumptions break if the data points avoid the corners of the hull of the archetypes. So how can we hope to reconstruct the archetypes in such cases? Well, instead of constraining the archetypes to the convex hull of the ‘s, you can penalize distance from the convex hull. This amounts to regularizing the objective to encourage achetype-ness. The following illustration from the paper is helpful:

The top left illustrates how the data points were generated from the unknown true archetypes, the top right shows the output of a method that assumes separability, the bottom left assumes each archetype lies in the convex hull of the data, and the bottom right gives the regularized reconstruction, which is closest to the ground truth. Figure 3 of the paper illustrates how they can robustly reconstruct molecule spectra from mixtures better than the competition.

**Venkat Chandrasekaran — Relative entropy relaxations for signomial optimization**

A * signomial* is a function of the form

where each and each . How can we certify whether is nonnegative for every ? In the special case where the ‘s have nonnegative integer entries, then can be expressed as a polynomial of for , and so we can show that is a sum of squares. Instead, Venkat’s paper provides an analogous decomposition: He uses the AM-GM inequality to certify the nonnegativity of certain signomials with at most one negative , and then he provides a tractable routine for testing whether a given signomial is a sum of such functions, i.e., a sum of AM-GM exponentials, or * SAGE*. Interestingly, testing for SAGE is often faster than testing for SOS. In fact, if you want to test whether a polynomial is nonnegative over the positive orthant, this suggests that changing variables to signomials and testing for SAGE might be a better alternative.

**Yaniv Plan — De-biasing low-rank projection for matrix completion**

In the real world, when you’re asked to do matrix completion, the matrix entries you’re given are far from uniformly distributed, and you don’t have time to run an SDP. Yaniv investigated how one might get around both of these bottlenecks. First, in the uniform case, instead of running an SDP, you get decent performance by just grabbing the top singular vectors of the incomplete matrix. As such, for runtime considerations, it makes sense to replicate this spectral-type approach in the non-uniform case. For the non-uniform case, notice that if you don’t see any entries of a given row or column, then the singular vectors will zero out that row/column. The quality of reconstruction should therefore be measured in terms of how well a given row or column is sampled. To quantify the extent to which a row/column is sampled, just grab the top left and right singular vector of the 0-1 matrix with 1s at the sampled locations. This weighting actually serves two purposes: It helps to evaluate the quality of reconstruction, and it also allows one to “de-bias” the matrix samples before running the spectral matrix completion method. In particular, if we suppose that the weighting corresponds to a probability distribution over which the samples were drawn, then if we entrywise divide a random incomplete matrix by the weighting matrix, the expected quotient will be the desired completed matrix. As such, one should divide by the weighting and complete with the top singular vectors, and then the weighted Frobenius norm of the error is guaranteed to be small.

**Deanna Needell — Tolerant compressed sensing with partially coherent sensing matrices**

Compressed sensing tells us that you can reconstruct any sparse vector from its product with a short, fat matrix that has incoherent columns. But what if the columns are coherent? Of course, if columns and are identical, then you won’t be able to tell the difference between vector supports that include from those that include , but in applications like radar, it is permissible to confuse certain entries. For example, suppose you are willing to confuse entries whose indices are at most away from each other. Then we can get away with having nearby columns of being coherent, as long as the distant columns are incoherent. In particular, the support of any vector can be recovered to within a tolerance of provided the sparse vector’s nonzero entries are sufficiently spread apart. This reminds me a lot of the CLEAN algorithm that’s commonly used in radar, and I find it interesting that you can get a guarantee that allows for tolerance in the support recovery. For this reason, this seems fundamentally different from the superresolution work that concerns conditions for exact recovery. I wonder if it’s possible to accommodate more nonzero entries with the help of randomness (a la RIP).

]]>

**1. The minimal coherence of 6 unit vectors in is 1/3.**

The Welch bound is known to not be tight whenever lies strictly between and (see the next section for a proof sketch). As such, new techniques are required to prove optimality in this range. We leverage ideas from real algebraic geometry to show how to solve the case of vectors in for all sufficiently small . For example, our method provides a new proof of the optimality of 5 non-antipodal vertices of the icosahedron in , as well as the optimality of Sloane’s packing of 6 lines in .

Our method hinges on an application of the Tarski–Seidenberg theorem, whose statement requires a definition: A * semialgebraic set* is any subset of a finite-dimensional real vector space that can be expressed in terms of a finite collection of polynomial equalities and inequalities. For example, the positive definite cone is semialgebraic by Sylvester’s criterion.

**Tarski–Seidenberg Theorem.** Any projection of a semialgebraic set is semialgebraic.

The proof of this theorem amounts to an explicit algorithm that finds the polynomial relations of the projection of a given semialgebraic set onto a hyperplane. One can then iterate this to project onto any lower-dimensional subspace. Notice that our packing problem is equivalent to finding the smallest for which there exists a Gram matrix of unit vectors in such that the largest squared off-diagonal entry is at most . Since the set of all such forms a semialgebraic set, one may run the Tarski–Seidenberg theorem to project onto the coordinate. The projection will have the form , and so one may conclude that is the tight lower bound on coherence in this case.

While this provides a finite-time algorithm for solving the packing problem, you wouldn’t want to solve the problem this way, since the Tarski–Seidenberg algorithm is way too slow. Instead, you’d use alternatives like cylindrical algebraic decomposition (CAD), but even this takes double exponential time in the number of variables. Considering our Gram matrix has variables, we are inclined to use more information about our problem to decrease this number of variables. To this end, you can show that for an optimal packing, the locations in the Gram matrix that achieve coherence correspond to the adjacency matrix of a so-called **-secure graph**, which tend to have edges. When is close to , e.g., , this reduces the number of variables to , which is far more palatable.

We used this technique to solve and . We applied Mathematica’s built-in implementation of CAD, and the code is available here and here. Both codes are fast, proving tight bounds in less than 30 seconds. We found that the runtime was extremely sensitive to the order of variables, and we didn’t have the patience to work out the case where . I mentioned this in a talk at SIAM AG17, and I was informed of even faster alternatives to CAD. We are currently investigating whether those alternatives will make larger cases more accessible.

**2. The Welch bound is within a constant factor of optimal in the Gerzon range.**

Recall that unit vectors achieve equality in the Welch bound only if they are equiangular. By lifting each vector to its outer product , one can see that the Gram matrix of these outer products is the entrywise square of the original Gram matrix. By equiangularity, the Gram matrix of the outer products has the form , where is the matrix of all ones and (provided ). As such, the Gram matrix is positive definite, meaning the outer products are not linearly dependent. Since these outer products lie in the -dimensional vector space of symmetric matrices, this then implies that the Welch bound is tight only if . Considering Welch bound equality is closed under the Naimark complement, this then produces a lower bound on the number of vectors in the nontrivial case where . Combined, we get the so-called * Gerzon range*:

Even in this range, it is known that Welch bound equality is an uncommon occurrence since certain integrality conditions must be satisfied. As such, it is interesting to consider how close the bound is to being tight in this range. To this end, we provide a linear-time constant-factor approximation algorithm for packing. In particular, given any in the Gerzon range, we provide an explicit packing whose coherence is guaranteed to be no larger than 49 times the Welch bound. Our construction is based on the following complex construction, which in turn is based on a famous character sum estimate due to Andre Weil:

**Theorem.** Let be a nontrivial additive character of . For each , define

For each , let denote the ‘s such that and . Then

We convert this packing to a real packing by replacing each complex entry with a 2-by-2 matrix involving its real and imaginary parts. This conversion doesn’t hurt the coherence. Also, in cases where is not twice a prime, we pad with zeros, but this doesn’t hurt too much thanks to Bertrand’s postulate. Surprisingly, it was hardest to analyze the left-most edge of the Gerzon range, and we had to play interesting games with Naimark complements to make good constructions.

We did not attempt to optimize our analysis, and judging by Sloane’s database, we suspect the optimal constant is less than 2. We think follow-on work along these lines would make a nice project for a student.

**3. We found two new infinite families of locally optimal packings.**

Since proving global optimality is so hard, we also looked into how to prove local optimality. We can reformulate the problem as minimizing subject to being in the manifold of Gram matrices of spanning -packings in . To see how to prove local optimality of a given packing, considering the following illustration:

We want to show that is a local minimizer of subject to the manifold . To this end, we can locally model the sublevel set and manifold (left) with the descent cone and tangent space (right). If 0 is the unique member of the intersection between the descent cone and the tangent space (which can be certified using the dual linear program provided the descent cone is a polytope), then is a local minimizer.

To use this theory, we studied the packings in Sloane’s database, and we observed some interesting patterns. For example, some of Sloane’s putatively optimal packings arise by removing a vector from an equiangular tight frame (ETF). Also, in the case where , his putatively optimal packings are frequently orthobiangular, and can be constructed by lifting smaller ETFs. We show that all such packings are locally optimal by constructing the appropriate dual certificates. Furthermore, some of these packings beat Sloane’s putatively optimal packings. See Table 1 in the paper for the improvements (denoted by stars).

**4. Many of Sloane’s putatively optimal packings are tight frames with few angles.**

When looking through Sloane’s database, we found that surprisingly many of the packings happen to form tight frames. Furthermore, most of these tight frames have few angles. This suggests a generalization of the Welch bound in which equality is characterized by a generalization of ETFs. In the absence of this more general theory, we developed short descriptions of each of Sloane’s putatively optimal packings that happen to be tight with small angle sets. Most of these packings involve classical objects like polytopes or lattices, incidence structures like Steiner ETFs, or “marriages” likes those introduced by Bodmann and Haas.

We were able to generalize some of the constructions based on incidence structures to infinite families whose coherence is a factor of away from the Welch bound. We were able to prove that these orthobiangular tight packings are optimal when restricting to packings that are orthobiangular and tight. It would be interesting to see some computational evidence of their optimality for larger dimensions than those investigated by Sloane.

]]>

**DGM:** What is the origin story of this project? Were you and Paul inspired by the “Compressed sensing using generative models” paper?

**VV:** I have been working extensively with applied deep learning for the last year or so, and have been inspired by recent applications of deep generative image priors to classical inverse problems, such as the super resolution work by Fei Fei Li et al. Moreover, recent work on regularizing with deep generative priors for synthesizing the preferred inputs to neural activations, by Yosinski et al., made me optimistic that GAN-based generative priors are capturing sophisticated natural image structure (the synthetic images obtained in this paper look incredibly realistic).

I’ve been aware of and excited about the idea of using deep learning to improve compressed sensing since CVPR 2016, but the numerics and theory provided by Bora et al. tipped me over the edge. In particular the numerical evidence in Bora et al. is pretty strong, both because of 10X reduction in sample complexity from traditional CS and the fact that SGD worked out of the box on empirical risk. It struck me as significant that MRI could potentially be sped up by another factor of 10X by these techniques.

My spiked interest in using generative models for CS coincided with a visit that Paul Hand made to the bay area during which we planned to initiate a new deep learning collaboration, and we laid down the initial theoretical and empirical groundwork for our paper during that same visit.

**DGM:** Do you have any intuition for why there are two basins of attraction? What is the significance of ?

**VV:** We have not found a particularly satisfying answer to this question. The behavior of the empirical risk objective at the distinguished negative multiple of (call it ) is a bit subtle; for instance the expected Hessian there is positive semi-definite but not strictly psd, so it’s unclear if it’s a local extremum or a saddle (we suspect it’s a degenerate saddle from some preliminary calculations). Empirically, it appears that there are potentially two critical points which get pushed closer and closer to as one cranks up the expansivity of the layers. The expected gradient at is zero, and one interpretation is that this has to do with the peculiarity of the ReLu activation function, in how it “restricts” movement in the higher layers. Note that in the one layer case, there is only one basin of attraction, so this double basin phenomenon only manifests itself for 2 and more layers.

A more technical explanation is as follows: consider the case of a two layer generator. Note that any non-zero vectors and for any , get mapped to disjoint support (and thus orthogonal) vectors by any map of the form where is a linear transformation. Because of the form of the ReLu function, any perturbation for small enough only changes the positive activations of the first layer, but the expected gradient of the 2nd layer w.r.t the first can be made to be zero along the dimensions corresponding to the positive activations by choosing a particular .

**DGM:** Do your landscape results reflect the empirical behavior of (non-random) generative models that are actually learned from data?

**VV:** The main evidence for deep CS using non-random generative models that I can point to are the empirical results in Bora et al. where indeed the observed behavior matches our theory, in that gradient descent on empirical risk converges to the desired solution. We have also done extensive numerics on random instances, for which the same is true. Regarding the assumptions of our theory, the expansivity of the layers seems to be realistic (an ideal generator should not be collapsing information between layers, and better compression corresponds to a smaller latent code space), and the independence of weights in each layer may be closer to reality than it appears at first glance, since an ideal generator may strive for independence between layer representations to maximize information efficiency.

**DGM:** What is Helm.ai? Do your results have any implications for autonomous navigation?

**VV:** Helm.ai is a startup I’m working on, which builds robust perception systems for autonomous navigation. We are tackling the most challenging aspects of the technology required to reach full autonomy for self-driving cars, drones and consumer robots. Semi supervised learning is a large component of what we work on, and deep generative models are certainly relevant toward that goal.

There is always a gap between theory and practice, but conceptually I believe that using deep generative models for CS will have wide implications, including for autonomous navigation. For instance, there are companies out there building LIDAR sensors with a higher resolution per time to cost ratio (which is necessary for applications) by using concepts from compressed sensing. If and when DCS takes off, we should see benefits to such efforts, but of course it takes years for new algorithmic techniques to trickle down… it took 12 years to get from the first convincing results on CS to an FDA approved CS-based MRI machine which is 10X faster.

**DGM:** What’s next for this line of investigation? Denoising? Phase retrieval? Other inverse problems?

**VV:** There are (too) many interesting follow-up directions! There are of course many technical extensions, which we will tackle in the journal version of the paper, but I will comment below on what I find as the most interesting high level direction, on which we are currently preparing an ArXiv submission.

The theoretical framework we propose potentially applies to any inverse problem for which deep generative priors may be obtained, especially when empirical risk minimization is an appropriate reconstruction method in un-regularized versions of those inverse problems. Given the work by Candes et al. on Wirtinger Flow, and work by John Wright et al. on the geometry of quadratic recovery problems, empirical risk should be a reasonable approach to enforcing generative priors for phase retrieval.

I am particularly excited about using generative priors for phase retrieval, because of the potential for tangibly improving performance in applications and due to recent evidence of the more severe sample complexity bottlenecks in sparsity-based compressive phase retrieval as compared to traditional CS. Phase retrieval is inherently ill-posed, and classical approaches at overcoming this ill-posedness involve enforcing numerous instance-specific constraints, which is tedious and requires fairly specific expertise. More modern proposals, for instance by Candes et al., are to take redundant measurements with different “masks”, but this technique is not readily physically realizable and requires blasting the sample of interest multiple times, which rapidly degrades or destroys the sample at hand (which is typically difficult to prepare/acquire in the first place). Thus, it would be beneficial to use a minimal number of observations, without changing the measurement modality, by exploiting signal structure.

Recent attempts to combine classical sparsity-based compressed sensing with phase retrieval toward this goal have been met with potential computational complexity bottlenecks, since observations seem to be required for recovering -sparse signals via compressive phase retrieval using current methods, which makes it all the more important to exploit more sophisticated structure of natural signals. Meanwhile, building generative priors is purely a data-driven approach, which doesn’t require building new physical apparatus or acquisition methodologies, nor does it necessarily require intimate knowledge of the specific problems at hand. All it would require is a large/diverse enough dataset of reconstructed biological structures and a powerful enough deep generative model. Enforcing such a deep generative prior then becomes a purely algorithmic challenge, without putting any extra onus on experimental scientists, in fact reducing the amount of modeling they typically have to do.

]]>

[Flammia described] the SIC-POVM problem as a “heartbreaker” because every approach you take seems super promising but then inevitably fizzles out without really giving you a great insight as to why.

Case in point, Joey and I identified a promising approach involving ideas from our association schemes paper. We were fairly optimistic, and Joey even bet me $5 that our approach would work. Needless to say, I now have this keepsake from Joey:

While our failure didn’t offer any great insights (as Flammia predicted), the experience forced me to review the literature on Zauner’s conjecture a lot more carefully. A few things caught my eye, and I’ll discuss them here. Throughout, SIC denotes “symmetric informationally complete line set” and WH denotes “the Weyl-Heisenberg group.”

**1. WH over produces a SIC only if .**

Of course, this works for since this reduces to WH over a cyclic group. The case corresponds to the famous Hoggar lines (introduced here). Last week, I learned that Godsil and Roy proved that this doesn’t work in general (see Lemma 3.1 here). What’s the obstruction? The system of equations doesn’t have a solution except for these two special cases. Sadly, this is not terribly enlightening.

Before seeing this result, I had assumed that SICs would arise from all WHs over finite abelian groups. After seeing Godsil and Roy’s result, I wrote an interpretation of the numerical optimization code briefly described in the computer study paper, and I failed to find SICs from WHs over other small non-cyclic abelian groups. Apparently, the cyclic groups and are special, but I have no idea why.

**2. SICs can be generated by groups other than WH.**

Back in 2003, Renes et al performed numerical optimization to determine whether SICs could be obtained from non-WH groups. To this end, they churned through certain members of the SmallGroups Library and found that the groups G(36,11), G(36,14), G(64,8) and G(81,9) all lead to SICs. Later, Grassl found exact SICs for the first and third of these groups by computing the appropriate Groebner bases (he also points out that G(36,14) is actually WH). For the record, the exact coordinates in these cases are about as ugly as the coordinates for the WH SICs. But as far as I can tell, these alternative constructions (which reside in dimensions 6 and 8) have been forgotten by the modern SIC literature. For example, they do not appear as “sporadic SICs” in the Exact SICs table.

**3. Prime-dimensional group-generated SICs are necessarily generated by WH.**

This was established by Huangjun Zhu in this paper back in 2010, and it suggests that WH is the “right” group to work with (if the substantial evidence in favor of Zauner’s conjecture weren’t enough). Unfortunately, the description length of exact fiducial vectors over WH scales poorly with the dimension. One is inclined to compress these descriptions into shorter, workable representations before attempting pattern recognition for theorem discovery. Based on my experience with constructing infinite families of ETFs, this is the most promising approach for a constructive proof of Zauner’s conjecture.

**4. It looks like WH SICs are always determined by explicit equations, instead of .**

Fuchs et al recently posed what they call * the 3d conjecture*, which asserts that the WH SICs are precisely the solutions to equations they give in (27)–(29) of their paper. The conjecture holds for , and it’s held up to numerical scrutiny for . This suggests a couple of new approaches: (1) Prove the 3d conjecture. (2) Prove that 3d implies Zauner. I wouldn’t be surprised if it’s easier to determine whether 3d admits solutions, so this could be an interesting conditional proof of Zauner.

**5. A constructive proof of Zauner’s conjecture may require progress on Hilbert’s 12th problem.**

A constructive proof requires a finite-length description of an infinite family of SICs, since the proof would contain such a description. For all known non-maxial ETFs (see this paper for a survey), the Gram matrix can always be phased in such a way that all of the entries are cyclotomic, and furthermore, expressing the Gram matrix entries in this way allows patterns to emerge that enables both a short description and a proof of ETF-ness for an infinite family. (For an illustrative example, consider the harmonic ETFs.)

As established back in 2012, all of the known exact WH SICs have the property that the orthogonal projection onto the line spanned by the fiducial vector has matrix entries that lie in an abelian extension of . Since they lie in an abelian extension of an abelian extension of , these entries are expressible by radicals, and this is the representation of choice in the Exact SICs table. However, Hilbert’s 12th problem suggests that a better representation might be available. By analogy, the Kronecker–Weber theorem gives that every abelian extension of is cyclotomic, and the Kronecker Jugendtraum gives that every abelian extension of an imaginary quadratic field can be obtained with values of certain elliptic functions. By contrast, we are looking at abelian extensions of a *real* quadratic field, which is not a solved case of Hilbert’s 12th. Still, one might leverage the Stark conjectures to find a suitable basis. Apparently, the computer algebra system PARI/GP makes this a plausible enterprise, but I haven’t found the time to write the necessary code. (I’m still recoiling from my latest Zauner burn with Joey.)

]]>

The following line from the introduction caught my eye:

For instance the print-out for exact fiducial 48a occupies almost a thousand A4 pages (font size 9 and narrow margins).

As my previous blog entry illustrated, the description length of SIC-POVM fiducial vectors appears to grow rapidly with . However, it seems that the rate of growth is much better than I originally thought. Here’s a plot of the description lengths of the known fiducial vectors (the new ones due to ACFW17 — available here — appear in red):

Note that the vertical axis has logarithmic scale. Unlike my interpretation from two years ago, the description lengths appear to exhibit subexponential growth in . Putting the horizontal axis in log scale says even more:

The dotted line depicts . This suggests that the description length scales with the number of entries in the Gram matrix.

For context, let’s consider the more general problem of constructing equiangular tight frames (ETFs) of vectors in dimension ; see this paper for a survey. In the real case, it suffices to determine the sign pattern of an ETF’s Gram matrix, which can be naively described in bits. However, there are several infinite families of real ETFs with much shorter description length. Indeed, the sign patterns are determined by certain strongly regular graphs, many of which enjoy a straightforward algebro-combinatorial construction.

In the case of SIC-POVMs, the Gram matrix is complex, so it doesn’t correspond to a strongly regular graph in the same way, but the conjectures used in ACFW17 suggest that the Gram matrix may be selected so as to satisfy certain group and number theoretic properties. But even after reducing to such specific structure, the description length appears to scale with the size of the Gram matrix (i.e., the naive scaling in the real case). As such, an infinite family of explicit SIC-POVMs will likely require the identification of additional structure. This is shocking, considering the conjectured structures that are currently used already seem miraculous.

]]>

**DGM: **How were you introduced to this problem? Do you have any particular applications of shape matching or point-cloud comparison in mind with this research?

**SV:** This problem was introduced to me by Andrew Blumberg in the context of topological data science. Andrew is an algebraic topologist who is also interested in applications, in particular in computational topology. There is a vast literature on the registration problem for 3d shapes and surfaces, but usually they are tailored to the geometric properties of the space and rely on strong geometry assumptions. Our goal was to study this problem in an abstract setting, that could have potential impact in spaces with unusual geometry. In particular we are thinking of spaces of phylogenetic trees, protein-protein interaction data, and text processing. We don’t have experimental results for those problems yet but we are working on it.

A reason why it is so hard to obtain meaningful results for these “real data” problems is that it is hard to validate whether the method produces a meaningful result. A simple way for a mathematician like me to validate the performance of our methods and algorithms is to compare with problems where the ground truth solution is known (like the teeth classification and shape matching), and this is what we did in the paper.

For future scientific applications, I’m working with Bianca Dumitrascu, who is a graduate student in computational biology at Princeton. Bianca works with large datasets of protein-protein interaction information. She has the intuition that the existence of isometries between protein interaction measurements in different biological systems should be correlated with similar roles between corresponding proteins. However such behavior is very hard to test in real data because of scalability issues, the large amount of noise present in the data, and the lack of a theoretical ground truth in most cases.

**DGM:** Do you have any intuition for why your polynomial-time lower bound on Gromov-Hausdorff distance satisfies the triangle inequality?

**SV:** The intuitive answer: I think this is a phenomenon aligned with the “data is not the enemy” philosophy. The Gromov-Hausdorff distance is NP-hard in the worst case, but it is actually computable in polynomial time for a generic set of metric spaces. Since in the small scale our relaxed distance coincides with the Gromov-Hausdorff distance, then intuitively we could expect that it is actually a distance (and therefore satisfies triangle inequality).

The practical answer: Considering the relations that realize and , there is a straightforward way to define a relation between and so that the Gromov-Hausdorff objective value for that relation is smaller or equal than . Just consider the composition! If the result of our semidefinite program is interpreted as a soft assignment between points from one metric space to another, then it is natural to ask what the composition of soft assignments is, whether it is feasible for the semidefinite program, and if it is upper bounded by . This is basically why the triangle inequality holds.

**DGM: **You proved that generic finite metric spaces enjoy a neighborhood of spaces whose Gromov-Hausdorff distance from equals your lower bound (i.e., your bound is generically tight for small perturbations). However, the size of the allowable perturbation seems quite small. Later, you mention that you frequently observe tightness in practice. Do you think that tightness occurs for much larger perturbations in the average case over some reasonable distribution?

**SV:** I think tightness occurs for relatively large perturbations of the isometric case provided that the data is well conditioned. However, in an extreme case, if all pairwise distances are the same, then the solution of the semidefinite program is not unique and therefore tightness will not occur. When studying the distance from the topological point of view, a result of the form “there exists a local neighborhood such that the distances coincide” is relevant. From an applied mathematical perspective, it would interesting to quantify for how large perturbations the semidefinite program is tight. The techniques I know for obtaining such a result rely on the construction of dual certificates. The dual certificates I managed to construct also had a dependency on (the minimum nonzero entry in ) due to degeneracy issues. I think it should be possible to obtain a tightness result for larger perturbations but I think it may be a hard problem. The way I would start thinking about this is with numerical experiments and a conjectured phase transition for tightness of the semidefinite program as a function of noise, for different ‘s.

**DGM:** How frequently does your local method GHMatch recover the Gromov-Hausdorff distance in practice? Is there a way to leverage the smallest eigenvector of to get a better initialization (a la Wirtinger Flow for phase retrieval)?

**SV: **The algorithm GHMatch often gets stuck in local minima. In many non-convex optimization algorithms, good initialization is good enough to guarantee convergence to a global optimal after some steps of gradient descent. However our optimization problem has non-negative constraints which makes it significantly harder because the variable needs to be at least thresholded to a non-negative after each iteration. There is a class of algorithms that attempts to do such things for Synchronization problems, such as Projected Power Methods, (see for example this paper). But the right algorithm is not to project just like that, but to weight carefully with Approximate Message Passing, as they do for example in this paper.

]]>