Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Vlad Voroninski recently posted an arXiv preprint with Paul Hand that provides compressed sensing guarantees using a neural net-based generative signal model. This offers some theoretical justification for the shocking empirical results presented in the “Compressed sensing using generative models” paper, which demonstrates compressed sensing with 10 times fewer measurements than conventional compressed sensing. I was especially excited to see this paper, having recently read Michael Elad’s editorial on deep learning. To learn more, I interviewed Vlad (see below); I’ve lightly edited his responses for formatting and hyperlinks:

DGM: What is the origin story of this project? Were you and Paul inspired by the “Compressed sensing using generative models” paper?

VV: I have been working extensively with applied deep learning for the last year or so, and have been inspired by recent applications of deep generative image priors to classical inverse problems, such as the super resolution work by Fei Fei Li et al. Moreover, recent work on regularizing with deep generative priors for synthesizing the preferred inputs to neural activations, by Yosinski et al., made me optimistic that GAN-based generative priors are capturing sophisticated natural image structure (the synthetic images obtained in this paper look incredibly realistic).

I’ve been aware of and excited about the idea of using deep learning to improve compressed sensing since CVPR 2016, but the numerics and theory provided by Price et al. tipped me over the edge. In particular the numerical evidence in Price et al. is pretty strong, both because of 10X reduction in sample complexity from traditional CS and the fact that SGD worked out of the box on empirical risk. It struck me as significant that MRI could potentially be sped up by another factor of 10X by these techniques.

My spiked interest in using generative models for CS coincided with a visit that Paul Hand made to the bay area during which we planned to initiate a new deep learning collaboration, and we laid down the initial theoretical and empirical groundwork for our paper during that same visit.

DGM: Do you have any intuition for why there are two basins of attraction? What is the significance of -1/\pi?

VV: We have not found a particularly satisfying answer to this question. The behavior of the empirical risk objective at the distinguished negative multiple of x_0 (call it -\rho_d x_0) is a bit subtle; for instance the expected Hessian there is positive semi-definite but not strictly psd, so it’s unclear if it’s a local extremum or a saddle (we suspect it’s a degenerate saddle from some preliminary calculations). Empirically, it appears that there are potentially two critical points which get pushed closer and closer to -\rho_d x_0 as one cranks up the expansivity of the layers. The expected gradient at -\rho_d x_0 is zero, and one interpretation is that this has to do with the peculiarity of the ReLu activation function, in how it “restricts” movement in the higher layers. Note that in the one layer case, there is only one basin of attraction, so this double basin phenomenon only manifests itself for 2 and more layers.

A more technical explanation is as follows: consider the case of a two layer generator. Note that any non-zero vectors x and -cx for any c>0, get mapped to disjoint support (and thus orthogonal) vectors by any map of the form \mathrm{ReLu}(A(x)) where A is a linear transformation. Because of the form of the ReLu function, any perturbation x_0+h for small enough h only changes the positive activations of the first layer, but the expected gradient of the 2nd layer w.r.t the first can be made to be zero along the dimensions corresponding to the positive activations by choosing a particular c>0.

DGM: Do your landscape results reflect the empirical behavior of (non-random) generative models that are actually learned from data?

VV: The main evidence for deep CS using non-random generative models that I can point to are the empirical results in Price et al. where indeed the observed behavior matches our theory, in that gradient descent on empirical risk converges to the desired solution. We have also done extensive numerics on random instances, for which the same is true. Regarding the assumptions of our theory, the expansivity of the layers seems to be realistic (an ideal generator should not be collapsing information between layers, and better compression corresponds to a smaller latent code space), and the independence of weights in each layer may be closer to reality than it appears at first glance, since an ideal generator may strive for independence between layer representations to maximize information efficiency.

DGM: What is Helm.ai? Do your results have any implications for autonomous navigation?

VV: Helm.ai is a startup I’m working on, which builds robust perception systems for autonomous navigation. We are tackling the most challenging aspects of the technology required to reach full autonomy for self-driving cars, drones and consumer robots. Semi supervised learning is a large component of what we work on, and deep generative models are certainly relevant toward that goal.

There is always a gap between theory and practice, but conceptually I believe that using deep generative models for CS will have wide implications, including for autonomous navigation. For instance, there are companies out there building LIDAR sensors with a higher resolution per time to cost ratio (which is necessary for applications) by using concepts from compressed sensing. If and when DCS takes off, we should see benefits to such efforts, but of course it takes years for new algorithmic techniques to trickle down… it took 12 years to get from the first convincing results on CS to an FDA approved CS-based MRI machine which is 10X faster.

DGM: What’s next for this line of investigation? Denoising? Phase retrieval? Other inverse problems?

VV: There are (too) many interesting follow-up directions! There are of course many technical extensions, which we will tackle in the journal version of the paper, but I will comment below on what I find as the most interesting high level direction, on which we are currently preparing an ArXiv submission.

The theoretical framework we propose potentially applies to any inverse problem for which deep generative priors may be obtained, especially when empirical risk minimization is an appropriate reconstruction method in un-regularized versions of those inverse problems. Given the work by Candes et al. on Wirtinger Flow, and work by John Wright et al. on the geometry of quadratic recovery problems, empirical risk should be a reasonable approach to enforcing generative priors for phase retrieval.

I am particularly excited about using generative priors for phase retrieval, because of the potential for tangibly improving performance in applications and due to recent evidence of the more severe sample complexity bottlenecks in sparsity-based compressive phase retrieval as compared to traditional CS. Phase retrieval is inherently ill-posed, and classical approaches at overcoming this ill-posedness involve enforcing numerous instance-specific constraints, which is tedious and requires fairly specific expertise. More modern proposals, for instance by Candes et al., are to take redundant measurements with different “masks”, but this technique is not readily physically realizable and requires blasting the sample of interest multiple times, which rapidly degrades or destroys the sample at hand (which is typically difficult to prepare/acquire in the first place). Thus, it would be beneficial to use a minimal number of observations, without changing the measurement modality, by exploiting signal structure.

Recent attempts to combine classical sparsity-based compressed sensing with phase retrieval toward this goal have been met with potential computational complexity bottlenecks, since O(k^2 \log n) observations seem to be required for recovering k-sparse signals via compressive phase retrieval using current methods, which makes it all the more important to exploit more sophisticated structure of natural signals. Meanwhile, building generative priors is purely a data-driven approach, which doesn’t require building new physical apparatus or acquisition methodologies, nor does it necessarily require intimate knowledge of the specific problems at hand. All it would require is a large/diverse enough dataset of reconstructed biological structures and a powerful enough deep generative model. Enforcing such a deep generative prior then becomes a purely algorithmic challenge, without putting any extra onus on experimental scientists, in fact reducing the amount of modeling they typically have to do.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s