Talks from the Summer of ’17

This summer, I participated in several interesting conferences. This entry documents my slides and describes a few of my favorite talks from the summer. Here are links to my talks:

UPDATE: SIAM AG17 just posted a video of my talk.

Now for my favorite talks from FoCM, ILAS, SIAM AG17 and SPIE:

Ben RechtUnderstanding deep learning requires rethinking generalization

In machine learning, you hope to fit a model so as to be good at prediction. To do this, you fit to a training set and then evaluate with a test set. In general, if a simple model fits a large training set pretty well, you can expect the fit to generalize, meaning it will also fit the test set. By conventional wisdom, if the model happens to fit the training set exactly, then your model is probably not simple enough, meaning it will not fit the test set very well. According to Ben, this conventional wisdom is wrong. He demonstrates this by presenting some observations he made while training neural nets. In particular, he allowed the number of parameters to far exceed the size of the training set, and in doing so, he fit the training set exactly, and yet he still managed to fit the test set well. He suggested that generalization was successful here because stochastic gradient descent implicitly regularizes. For reference, in the linear case, stochastic gradient descent (aka the randomized Kaczmarz method) finds the solution of minimal 2-norm, and it converges faster when the optimal solution has smaller 2-norm. Along these lines, Ben has some work to demonstrate that even in the nonlinear case, fast convergence implies generalization.

Afonso BandeiraThe sample complexity of multi-reference alignment

How do you reconstruct a function over $\mathbb{Z}/n\mathbb{Z}$ given a collection of noisy translations of the function? Intuitively, you might use one of the noisy translations as a template and then try to find the translation of best fit for each of the others before averaging. Perhaps surprisingly, this fails miserably. For example, if you find the translation of best fit for a bunch of pure noise functions, then the average appears to approach the template, thereby demonstrating so-called model bias. Another approach is to collect translation-invariant features of the functions. For example, you can estimate the average value of the function with $O(1/\mathrm{SNR})$ samples, the power spectrum with $O(1/\mathrm{SNR}^2)$ samples, and the bispectrum with $O(1/\mathrm{SNR}^3)$ samples. It turns out that the bispectrum determines generic functions up to translation, but is there an alternative that provides smaller sample complexity? Afonso’s main result here: No. In fact, there are two functions that are confusable unless you see enough samples. I wonder what sort of improvements can be made given additional structural information on the function.

Vern PaulsenQuantum chromatic numbers via operator systems

Given a graph, color the vertices so that adjacent vertices receive distinct colors. The chromatic number of the graph is the smallest number of colors you need to accomplish this task. Here’s another way to phrase the coloring task: Put Alice and Bob in separate rooms, and simultaneously ask them the color of certain vertices. If the vertices you ask about are adjacent, Alice and Bob must report different colors. If the vertices are identical, they must report the same color. The chromatic number is the smallest number of colors for which Alice and Bob have a winning strategy without communicating. If you allow Alice and Bob access to a common random source, then this smallest number of colors does not change. However, if you allow them access to entangled particles, then the smallest number of colors frequently does change. This suggests a new graph invariant called the quantum chromatic number. Interestingly, the quantum version is sometimes much smaller and much easier to calculate than the classical version. For example, the Hadamard graph of parameter $n$, the classical chromatic number is only known to be somewhere between $1.98^n$ and $2^n$, whereas the quantum chromatic number is known to be exactly $n$. Developing a quantum protocol for any given graph amounts to finding interesting arrangements of subspaces, which I think would appeal to the frame theory community.

Hamid JavadiNon-negative matrix factorization via archetypal analysis

Given a collection of points $x_1,\ldots,x_n$ in $\mathbb{R}^d$, how do you find a small number of “archetypes” $h_1,\ldots,h_\ell$ such that each $x_i$ is close to the convex hull of the $h_j$‘s and each $h_j$ is close to the convex hull of the $x_i$‘s? This problem has a number of applications in data science, and if we further ask for the $h_j$‘s to be entrywise nonnegative, this is equivalent to the problem of nonnegative matrix factorization (NMF). A lot of work in NMF has used a generative model with a so-called separability assumption, which asks for each archetype to be one of the data points. Other work by Cutler and Breiman relaxed the separability assumption, merely asking for each archetype to lie in the convex hull of the data points. Unfortunately, these assumptions break if the data points avoid the corners of the hull of the archetypes. So how can we hope to reconstruct the archetypes in such cases? Well, instead of constraining the archetypes to the convex hull of the $x_i$‘s, you can penalize distance from the convex hull. This amounts to regularizing the objective to encourage achetype-ness. The following illustration from the paper is helpful:

The top left illustrates how the data points were generated from the unknown true archetypes, the top right shows the output of a method that assumes separability, the bottom left assumes each archetype lies in the convex hull of the data, and the bottom right gives the regularized reconstruction, which is closest to the ground truth. Figure 3 of the paper illustrates how they can robustly reconstruct molecule spectra from mixtures better than the competition.

Venkat ChandrasekaranRelative entropy relaxations for signomial optimization

A signomial is a function $f\colon\mathbb{R}^n\mapsto\mathbb{R}$ of the form

$\displaystyle{f(x) = \sum_{j=1}^\ell c_j \exp(\alpha_j^\top x),}$

where each $c_j\in\mathbb{R}$ and each $\alpha_j\in\mathbb{R}^n$. How can we certify whether $f(x)$ is nonnegative for every $x$? In the special case where the $\alpha_j$‘s have nonnegative integer entries, then $f(x)$ can be expressed as a polynomial of $y_i=e^{x_i}$ for $i\in[n]$, and so we can show that $f(\log y)$ is a sum of squares. Instead, Venkat’s paper provides an analogous decomposition: He uses the AM-GM inequality to certify the nonnegativity of certain signomials with at most one negative $c_j$, and then he provides a tractable routine for testing whether a given signomial is a sum of such functions, i.e., a sum of AM-GM exponentials, or SAGE. Interestingly, testing for SAGE is often faster than testing for SOS. In fact, if you want to test whether a polynomial is nonnegative over the positive orthant, this suggests that changing variables to signomials and testing for SAGE might be a better alternative.

Yaniv Plan — De-biasing low-rank projection for matrix completion

In the real world, when you’re asked to do matrix completion, the matrix entries you’re given are far from uniformly distributed, and you don’t have time to run an SDP. Yaniv investigated how one might get around both of these bottlenecks. First, in the uniform case, instead of running an SDP, you get decent performance by just grabbing the top singular vectors of the incomplete matrix. As such, for runtime considerations, it makes sense to replicate this spectral-type approach in the non-uniform case. For the non-uniform case, notice that if you don’t see any entries of a given row or column, then the singular vectors will zero out that row/column. The quality of reconstruction should therefore be measured in terms of how well a given row or column is sampled. To quantify the extent to which a row/column is sampled, just grab the top left and right singular vector of the 0-1 matrix with 1s at the sampled locations. This weighting actually serves two purposes: It helps to evaluate the quality of reconstruction, and it also allows one to “de-bias” the matrix samples before running the spectral matrix completion method. In particular, if we suppose that the weighting corresponds to a probability distribution over which the samples were drawn, then if we entrywise divide a random incomplete matrix by the weighting matrix, the expected quotient will be the desired completed matrix. As such, one should divide by the weighting and complete with the top singular vectors, and then the weighted Frobenius norm of the error is guaranteed to be small.

Deanna NeedellTolerant compressed sensing with partially coherent sensing matrices

Compressed sensing tells us that you can reconstruct any sparse vector from its product with a short, fat matrix $\Phi$ that has incoherent columns. But what if the columns are coherent? Of course, if columns $i$ and $j$ are identical, then you won’t be able to tell the difference between vector supports that include $i$ from those that include $j$, but in applications like radar, it is permissible to confuse certain entries. For example, suppose you are willing to confuse entries whose indices are at most $d$ away from each other. Then we can get away with having nearby columns of $\Phi$ being coherent, as long as the distant columns are incoherent. In particular, the support of any vector can be recovered to within a tolerance of $d$ provided the sparse vector’s nonzero entries are sufficiently spread apart. This reminds me a lot of the CLEAN algorithm that’s commonly used in radar, and I find it interesting that you can get a guarantee that allows for tolerance in the support recovery. For this reason, this seems fundamentally different from the superresolution work that concerns conditions for exact recovery. I wonder if it’s possible to accommodate more nonzero entries with the help of randomness (a la RIP).