Compressive classification and the rare eclipse problem

I recently wrote a paper with Afonso Bandeira and Ben Recht, and now it’s on the arXiv (perhaps you heard from Igor). The main idea is dimensionality reduction for classification, but the motivation is different:

Compressed sensing allows for data recovery after compressive measurements, but classification should require even less information.

Since classification is a many-to-one labeling function, we should be satisfied with classifying using compressive measurements in which points of common labels collide. The main concern is to ensure that the measurements distinguish points of differing labels.

To prove this point, we focused on compressively classifying points from full-dimensional sets. This is a very different model from previous studies of compressive classification, all of which focused on low-dimensionality (taking inspiration from compressed sensing). For 2-label classification tasks, the best-case scenario is to have linear separability, which means there exists a hyperplane that separates the two sets of differing labels. This property allows a classifier to simply threshold an inner product with a certain vector to discern which set a given point must belong to. Linear separability is equivalent to having the convex hulls of the two sets be disjoint. This leads to a rather natural question:

Continue reading

Sudakov minoration: From Gaussian widths to epsilon nets

In a previous post, I went through Gordon’s escape through a mesh theorem. This theorem is unique because it leverages certain properties of Gaussian processes (and a quantification of size called Gaussian width) instead of passing to an epsilon net. The escape theorem is particularly important to areas like compressed sensing, and provides optimal guarantees in light of phase transitions. However, real-world implementations of compressed sensing tend to not apply Gaussian transforms, but rather random matrices with other distributions (such as Bernoulli or partial Fourier). As such, it is desirable to relax the hypothesis on the transform for the escape theorem to hold.

It is evident that Gaussian width is the correct measure of a set for phase transitions in the Gaussian setting, and we expect similar phase transitions in other settings. As detailed in this paper, an important measure of a set for the other settings is the size of an epsilon net. This leads to the following fundamental question:

What does the Gaussian width of a set say about the smallest possible size of an epsilon net for that set?

Continue reading

Compressed Sensing 3.0

I’m attending an NSF program review, and Thomas Strohmer gave an awesome talk yesterday. Imagine a scenario in which you are trying to solve a linear inverse problem, but you don’t have complete information about the linear operator. Apparently, this is common when working with small sensors or with high-resolution data. Here’s the problem formulation: You receive


where \theta\in\Theta is an unknown parameter, A\colon\Theta\rightarrow\mathbb{R}^{M\times N} is a known sensing matrix function (i.e., if you knew \theta, then you could determine the sensing matrix A(\theta)), and x\in\mathbb{R}^N is an unknown signal exhibiting some notion of sparsity or structure. So what are the conditions that allow one to recover \theta and x from y?

In the case where \theta is known, this is simply compressed sensing. Actually, you might call it

  • Compressed Sensing 1.0 if x is sparse, i.e., \#\mathrm{supp}(x) is small, or
  • Compressed Sensing 2.0 if x has low rank (when reshaped into a matrix).

Continue reading

Living on the edge: A geometric theory of phase transitions in convex optimization

Phase transitions are very common in modern data analysis (see this paper, for example). The idea is that, for a given task whose inputs are random linear measurements, there is often a magic number of measurements M^* such that if you input M\ll M^* measurements, the task will typically fail, but if you input M\gg M^* measurements, the task will typically succeed. As a toy example, suppose the task is to reconstruct an arbitrary vector in \mathbb{R}^N from its inner products with M random vectors; of course, M^*=N in this case. There are many tasks possible with data analysis, signal processing, etc., and it’s interesting to see what phase transitions emerge in these cases. The following paper introduces some useful techniques for exactly characterizing phase transitions of tasks involving convex optimization:

Living on the edge: A geometric theory of phase transitions in convex optimization

Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp

Along with this paper, I highly recommend watching this lecture by Joel Tropp on the same topic. This blog entry is based on both the paper and the lecture.

Continue reading

Gordon’s escape through a mesh theorem

In many applications (such as compressed sensing), we use a random projection to capture a subset of \mathbb{R}^n. Sometimes, you can show that a random projection performs well by appealing to the Johnson–Lindenstrauss lemma, which says that the pairwise distances of a finite collection of points are nearly preserved (for example, JL implies RIP). In other cases, we might call a projection “good” if its null space avoids a certain subset of \mathbb{R}^n (think the null space property), but how might we show that a random projection is good? By appealing to Gordon’s escape through a mesh theorem, which says that a random subspace avoids a subset (“escapes a mesh”) provided the subset is small in some sense. The purpose of this blog entry is to prove this theorem and provide some intuition.

Throughout, we take g_k to be a k-dimensional vector with iid N(0,1) entries, and we denote a_k:=\mathbb{E}\|g_k\|_2, which satisfies k/\sqrt{k+1}<a_k<\sqrt{k}. Given a closed subset S\subseteq\mathbb{R}^n, then its Gaussian width is defined as

\displaystyle{w(S):=\mathbb{E}\Big[\max_{x\in S}g_n^\top x\Big]}.

As we will see, Gaussian width is an important measure of the size of a set. Intuitively, the Gaussian width of S is more or less the expected width of S when measured in a random direction (it’s actually half of the mean width, as defined here), and it plays a crucial role in several results, including the following:

Continue reading

My research explained, using only the 1000 most common English words

[Inspired by ScottAfonso and Joel, using the Up-Goer Five Text Editor, which in turn was inspired by this xkcd. I actually only use 261 distinct words.]

Let’s say you have a picture, a piece of music, or a movie that you want to store on your computer. You can do this without taking up your entire hard drive, but why? Because there’s a way to look at each of these things so that they appear very simple: Imagine someone is making a movie of you reading this. You’re just sitting there. Maybe a fly is flying around the room, but not much is changing. That means each moment of the movie looks a lot like the one right before, and this makes it very easy to store the entire movie on a computer.

The fact that pictures and such are so simple allows you to do other cool stuff. Let’s say you find your favorite movie in the back of a second-hand store, but when you watch it at home, different marks pop up every now and then. Since movies are so simple, you can use a computer to fill in what you can’t see, and make it good as new.

Continue reading

Deterministic RIP matrices: Breaking the square-root bottleneck, III

This blog entry is the third in a series which seeks to maximize \epsilon_0 such that there exist explicit M\times N matrices of arbitrarily large aspect ratio N/M which are RIP for K=\Omega(M^{1/2+\epsilon_0}). Continuing our study of the breakthrough paper by Bourgain et al., this entry will identify what the set \mathcal{A} needs to satisfy, how to construct it, and we will also use this set to complete the proof of the main result. In particular, we will prove a general version of the result which allows for optimizing constants, and then we will perform an optimization routine that improves \epsilon_0 by a few orders of magnitude.

— Everything you need to know about \mathcal{A}

The main result will use an even number m as a free parameter, which in turn determines another parameter \alpha=\alpha(m). For a given choice of m, the proof of the main result requires a subset \mathcal{A}\subseteq\mathbb{F}_p such that (i)

|\mathcal{A}|\leq p^\alpha \quad (2.3)

(i.e., \mathcal{A} is sufficiently small), and (ii) for any a\in\mathcal{A}, the only way for a_1,\ldots,a_{2m}\in\mathcal{A}\setminus\{a\} to satisfy

\displaystyle{\sum_{j=1}^m\frac{1}{a-a_j}=\sum_{j=m+1}^{2m}\frac{1}{a-a_j} \quad (2.4)}

is by taking (a_1,\ldots,a_m) and (a_{m+1},\ldots,a_{2m}) to be permutations of each other. Here, division (and addition) is taken in the field \mathbb{F}_p.

Continue reading

Matheon Workshop 2013: Compressed Sensing and its Applications

A couple of weeks ago, I was in Berlin for this workshop, and the talks were amazing — the organizers really did a fantastic job. For most of the talks, the slides are available here. There was an interesting mix of speakers; some had big problems in certain applications with the hope that compressed sensing might help, and others presented various generalizations and unifications of the standard compressed sensing theory. Overall, I noticed a few trends in the content of the talks:

  • applying CS theory to imaging and video
  • CS problems over the continuum or in a coherent dictionary
  • nonlinear CS and phase retrieval
  • dimension reduction for classification and machine learning
  • measurement derandomization

The rest of this blog entry summarizes each talk. Some summaries are more brief than others, mostly according to how fervently I was writing notes during the talk.

Continue reading

Deterministic RIP matrices: Breaking the square-root bottleneck, II

In my previous blog entry, I kicked off an online project to make progress on the deterministic RIP matrix problem by optimizing the analysis in a breakthrough paper by Bourgain et al. Since then, I launched a wiki to maintain a “scoreboard” of world records for this problem, and Afonso Bandeira made his own contribution to the project. The purpose of this entry is to introduce some additive combinatorics and to develop some understanding of how they are leveraged to prove RIP for a particular matrix construction. While this entry does not immediately produce a new world record, it does optimize the analysis of Corollaries 3 and 4 of the paper. I plan to generalize Lemmas 9 and 10 in a future blog post, which will lead to an explicit optimization problem that incorporates the work in this entry (and a new world record, hopefully).

— A brief introduction to additive combinatorics —

In this section, we briefly detail some key ideas from additive combinatorics. Given an additive group G and finite sets A,B\subseteq G, we can define the sumset

A+B:=\{a+b:a\in A,~b\in B\},

the difference set

A-B:=\{a-b:a\in A,~b\in B\}

(not to be confused with a difference set), and the additive energy

E(A,B):=\#\big\{(a_1,a_2,b_1,b_2)\in A^2\times B^2:a_1+b_1=a_2+b_2\big\}.

These definitions are useful in quantifying the additive structure of a set. In particular, consider the following:

Continue reading

Deterministic RIP matrices: Breaking the square-root bottleneck

Unless you’ve been living under a rock, you probably know about Polymath8, and how it has been a dramatic success. In May, Yitang Zhang published a proof that there are infinitely many pairs of primes which differ by 70 million or less, and since then, the Polymath project exploded to decrease this number considerably. In a recent blog entry, Terry Tao took some time to reflect on why this Polymath project received so much attention. Here is one of the reasons he points out:

One could summarise the current state of progress by a single natural number H, which implied by infinite descent that the project was guaranteed to terminate at some point, but also made it possible to set up a “scoreboard” that could be quickly and easily updated.

In reading this, I was reminded of another big open problem — perhaps the biggest open problem in compressed sensing:

The deterministic RIP matrix problem. Construct an infinite family of deterministic matrices \{\Phi_M\}, where \Phi_M is M\times N(M) and M and N(M)/M are both arbitrarily large, such that each \Phi_M satisfies the (K,\delta)-restricted isometry property (RIP) with K=\Omega(M^{1-\epsilon}) for all \epsilon>0 and with \delta<1/2.

Continue reading