How Not to Be Wrong: The Power of Mathematical Thinking

I recently read this book by Jordan Ellenberg, and I wanted to briefly discuss it. My first impressions:

  • Is that really where “not” belongs in the title? (Answer: yes.)
  • This is nothing like The Grasshopper King.
  • The footnotes are fun.

Jordan introduces the book with a college student asking why math is so important. This book is an answer of sorts: It provides a bunch of simple, yet profound morsels of mathematical thinking. Actually, most of Jordan’s examples reveal how to properly think about the sort of math you might encounter in a newspaper (e.g., statistical significance that balding signifies future prostate cancer). I wonder if this book could form the basis of a “Math for the Humanities” type of class. Such a class might have more to offer the non-technical college student than calculus would. Overall, I highly recommend this book to everyone (including my mathematically disoriented wife).

Continue reading

Robust width: A characterization of uniformly stable and robust compressed sensing

Jameson Cahill and I recently posted a new paper on the arXiv that we’re pretty excited about. Suppose you want to do compressed sensing with L1 minimization. That is, you get y=\Phi x^\natural+e for some nearly K-sparse vector x^\natural and some noise e satisfying \|e\|_2\leq\epsilon, and then you attempt to reconstruct x^\natural by taking

\arg\min \quad \|x\|_1 \quad\mbox{ subject to } \quad\|\Phi x-y\|_2\leq\epsilon.

One of the main results of compressed sensing is that the optimizer x^\star of the above program satisfies

\displaystyle{\|x^\star-x^\natural\|_2\leq \frac{C_0}{\sqrt{K}}\|x^\natural-x^\natural_K\|_1+C_1\epsilon}

provided \Phi satisfies the (2K,\delta)-restricted isometry property (RIP) for some \delta<\sqrt{2}-1. Here, x^\natural_K denotes the best K-term approximation of x^\natural, and so \|x^\natural-x^\natural_K\|_1 captures how close x^\natural is to being K-sparse. Jameson and I wondered if RIP was necessary to enjoy this uniformly stable and robust performance (uniform in the sense that the fixed matrix works for all x^\natural simultaneously, stable because we don’t require x^\natural to be exactly K-sparse, and robust because we allow noise). In our paper, we completely characterize the matrices that offer this performance, and we show that RIP matrices form a strict subset.

Continue reading

Introduction to fast matrix multiplication

Recall how to multiply n\times n matrices A and B:

\displaystyle{(AB)_{ik}=\sum_{j=1}^n A_{ij}B_{jk}}

For each (i,k), it takes O(n) operations to calculate this sum, and so naive matrix multiplication takes O(n^3) operations total. Let \omega denote the infimum of \alpha for which there exists an algorithm that computes the product of any two n\times n matrices in O(n^{\alpha}) operations.

Conjecture. \omega=2.

A 3-way tensor is a complex function of index triples, i.e., T\colon I\times J\times K\rightarrow\mathbb{C}. A tensor has rank 1 if it enjoys a factorization of the form

\displaystyle{T_{ijk}=x_iy_jz_k \qquad \forall (i,j,k)\in I\times J\times K.}

The rank of a tensor is the size of its smallest decomposition into rank-1 tensors (like matrix rank).

Continue reading

Derandomizing restricted isometries via the Legendre symbol

I recently finished writing the paper based on a talk I gave at MATHEON at TU Berlin last year. (The talk was very fun to give — halfway through, it featured a game show in which Felix Krahmer won a bottle of Tabasco.) This work was a collaboration with Afonso Bandeira, Matt Fickus and Joel Moreira, and in the original version (presented in my talk), we mixed two ingredients to derandomize Bernoulli RIP matrices:

(i) a strong version of the Chowla conjecture, and

(ii) the flat RIP trick from this paper.

The Chowla conjecture essentially states that as n gets large, if you randomly draw X from \{1,\ldots,n\}, then consecutive entries of the Liouville function \lambda(X+1),\ldots,\lambda(X+r) are asymptotically independent. The strong version of the Chowla conjecture that we used in (i) not only gave asymptotic independence, but also provided a rate of convergence. The intuition is that consecutive Liouville function entries behave very similarly to iid Bernoulli \pm1‘s, and so we should expect that populating a matrix with these entries will yield RIP. Indeed, we used flat RIP to demonstrate random-like cancellations that prove RIP. Unfortunately, the strong version of the Chowla conjecture that we used implies the Chowla conjecture, which has been open for almost 50 years.

Continue reading

Compressed sensing: Variations on a theme

A couple of months ago, I attended a workshop at Oberwolfach (my first!) called “Mathematical Physics meets Sparse Recovery.” I had a great time. I was asked to give the first talk of the week to get everyone on the same page with respect to sparse recovery. Here are the slides from my talk. What follows is an extended abstract (I added hyperlinks throughout for easy navigation):

Compressed sensing has been an exciting subject of research over the last decade, and the purpose of this talk was to provide a brief overview of the subject. First, we considered certain related topics (namely image compression and denoising) which led up to the rise of compressed sensing. In particular, wavelets provide a useful model for images, as natural images tend to be approximated by linear combinations of particularly few wavelets. This sparsity model has enabled JPEG2000 to provide particularly efficient image compression with negligible distortion. Additionally, this model has been leveraged to remove random noise from natural images.

Considering natural images enjoy such a useful model, one may ask whether the model can be leveraged to decrease the number of measurements necessary to completely determine an image. For example, an MRI scan might require up to 2 hours of exposure time, and then the image might be compressed with JPEG2000 after the fact, meaning most of the measurements can be effectively ignored. So is it possible to simply measure the important parts of the image and not waste time in the image acquisition process? This is the main idea underlying compressed sensing, as introduced by Candes, Romberg and Tao and Donoho.

Continue reading

Compressive classification and the rare eclipse problem

I recently wrote a paper with Afonso Bandeira and Ben Recht, and now it’s on the arXiv (perhaps you heard from Igor). The main idea is dimensionality reduction for classification, but the motivation is different:

Compressed sensing allows for data recovery after compressive measurements, but classification should require even less information.

Since classification is a many-to-one labeling function, we should be satisfied with classifying using compressive measurements in which points of common labels collide. The main concern is to ensure that the measurements distinguish points of differing labels.

To prove this point, we focused on compressively classifying points from full-dimensional sets. This is a very different model from previous studies of compressive classification, all of which focused on low-dimensionality (taking inspiration from compressed sensing). For 2-label classification tasks, the best-case scenario is to have linear separability, which means there exists a hyperplane that separates the two sets of differing labels. This property allows a classifier to simply threshold an inner product with a certain vector to discern which set a given point must belong to. Linear separability is equivalent to having the convex hulls of the two sets be disjoint. This leads to a rather natural question:

Continue reading

Sudakov minoration: From Gaussian widths to epsilon nets

In a previous post, I went through Gordon’s escape through a mesh theorem. This theorem is unique because it leverages certain properties of Gaussian processes (and a quantification of size called Gaussian width) instead of passing to an epsilon net. The escape theorem is particularly important to areas like compressed sensing, and provides optimal guarantees in light of phase transitions. However, real-world implementations of compressed sensing tend to not apply Gaussian transforms, but rather random matrices with other distributions (such as Bernoulli or partial Fourier). As such, it is desirable to relax the hypothesis on the transform for the escape theorem to hold.

It is evident that Gaussian width is the correct measure of a set for phase transitions in the Gaussian setting, and we expect similar phase transitions in other settings. As detailed in this paper, an important measure of a set for the other settings is the size of an epsilon net. This leads to the following fundamental question:

What does the Gaussian width of a set say about the smallest possible size of an epsilon net for that set?

Continue reading

Compressed Sensing 3.0

I’m attending an NSF program review, and Thomas Strohmer gave an awesome talk yesterday. Imagine a scenario in which you are trying to solve a linear inverse problem, but you don’t have complete information about the linear operator. Apparently, this is common when working with small sensors or with high-resolution data. Here’s the problem formulation: You receive


where \theta\in\Theta is an unknown parameter, A\colon\Theta\rightarrow\mathbb{R}^{M\times N} is a known sensing matrix function (i.e., if you knew \theta, then you could determine the sensing matrix A(\theta)), and x\in\mathbb{R}^N is an unknown signal exhibiting some notion of sparsity or structure. So what are the conditions that allow one to recover \theta and x from y?

In the case where \theta is known, this is simply compressed sensing. Actually, you might call it

  • Compressed Sensing 1.0 if x is sparse, i.e., \#\mathrm{supp}(x) is small, or
  • Compressed Sensing 2.0 if x has low rank (when reshaped into a matrix).

Continue reading

Living on the edge: A geometric theory of phase transitions in convex optimization

Phase transitions are very common in modern data analysis (see this paper, for example). The idea is that, for a given task whose inputs are random linear measurements, there is often a magic number of measurements M^* such that if you input M\ll M^* measurements, the task will typically fail, but if you input M\gg M^* measurements, the task will typically succeed. As a toy example, suppose the task is to reconstruct an arbitrary vector in \mathbb{R}^N from its inner products with M random vectors; of course, M^*=N in this case. There are many tasks possible with data analysis, signal processing, etc., and it’s interesting to see what phase transitions emerge in these cases. The following paper introduces some useful techniques for exactly characterizing phase transitions of tasks involving convex optimization:

Living on the edge: A geometric theory of phase transitions in convex optimization

Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp

Along with this paper, I highly recommend watching this lecture by Joel Tropp on the same topic. This blog entry is based on both the paper and the lecture.

Continue reading

Gordon’s escape through a mesh theorem

In many applications (such as compressed sensing), we use a random projection to capture a subset of \mathbb{R}^n. Sometimes, you can show that a random projection performs well by appealing to the Johnson–Lindenstrauss lemma, which says that the pairwise distances of a finite collection of points are nearly preserved (for example, JL implies RIP). In other cases, we might call a projection “good” if its null space avoids a certain subset of \mathbb{R}^n (think the null space property), but how might we show that a random projection is good? By appealing to Gordon’s escape through a mesh theorem, which says that a random subspace avoids a subset (“escapes a mesh”) provided the subset is small in some sense. The purpose of this blog entry is to prove this theorem and provide some intuition.

Throughout, we take g_k to be a k-dimensional vector with iid N(0,1) entries, and we denote a_k:=\mathbb{E}\|g_k\|_2, which satisfies k/\sqrt{k+1}<a_k<\sqrt{k}. Given a closed subset S\subseteq\mathbb{R}^n, then its Gaussian width is defined as

\displaystyle{w(S):=\mathbb{E}\Big[\max_{x\in S}g_n^\top x\Big]}.

As we will see, Gaussian width is an important measure of the size of a set. Intuitively, the Gaussian width of S is more or less the expected width of S when measured in a random direction (it’s actually half of the mean width, as defined here), and it plays a crucial role in several results, including the following:

Continue reading