# Introduction to fast matrix multiplication

Recall how to multiply $n\times n$ matrices $A$ and $B$:

$\displaystyle{(AB)_{ik}=\sum_{j=1}^n A_{ij}B_{jk}}$

For each $(i,k)$, it takes $O(n)$ operations to calculate this sum, and so naive matrix multiplication takes $O(n^3)$ operations total. Let $\omega$ denote the infimum of $\alpha$ for which there exists an algorithm that computes the product of any two $n\times n$ matrices in $O(n^{\alpha})$ operations.

Conjecture. $\omega=2$.

A 3-way tensor is a complex function of index triples, i.e., $T\colon I\times J\times K\rightarrow\mathbb{C}$. A tensor has rank 1 if it enjoys a factorization of the form

$\displaystyle{T_{ijk}=x_iy_jz_k \qquad \forall (i,j,k)\in I\times J\times K.}$

The rank of a tensor is the size of its smallest decomposition into rank-1 tensors (like matrix rank).

# Derandomizing restricted isometries via the Legendre symbol

I recently finished writing the paper based on a talk I gave at MATHEON at TU Berlin last year. (The talk was very fun to give — halfway through, it featured a game show in which Felix Krahmer won a bottle of Tabasco.) This work was a collaboration with Afonso Bandeira, Matt Fickus and Joel Moreira, and in the original version (presented in my talk), we mixed two ingredients to derandomize Bernoulli RIP matrices:

(i) a strong version of the Chowla conjecture, and

(ii) the flat RIP trick from this paper.

The Chowla conjecture essentially states that as $n$ gets large, if you randomly draw $X$ from $\{1,\ldots,n\}$, then consecutive entries of the Liouville function $\lambda(X+1),\ldots,\lambda(X+r)$ are asymptotically independent. The strong version of the Chowla conjecture that we used in (i) not only gave asymptotic independence, but also provided a rate of convergence. The intuition is that consecutive Liouville function entries behave very similarly to iid Bernoulli $\pm1$‘s, and so we should expect that populating a matrix with these entries will yield RIP. Indeed, we used flat RIP to demonstrate random-like cancellations that prove RIP. Unfortunately, the strong version of the Chowla conjecture that we used implies the Chowla conjecture, which has been open for almost 50 years.

# Compressed sensing: Variations on a theme

A couple of months ago, I attended a workshop at Oberwolfach (my first!) called “Mathematical Physics meets Sparse Recovery.” I had a great time. I was asked to give the first talk of the week to get everyone on the same page with respect to sparse recovery. Here are the slides from my talk. What follows is an extended abstract (I added hyperlinks throughout for easy navigation):

Compressed sensing has been an exciting subject of research over the last decade, and the purpose of this talk was to provide a brief overview of the subject. First, we considered certain related topics (namely image compression and denoising) which led up to the rise of compressed sensing. In particular, wavelets provide a useful model for images, as natural images tend to be approximated by linear combinations of particularly few wavelets. This sparsity model has enabled JPEG2000 to provide particularly efficient image compression with negligible distortion. Additionally, this model has been leveraged to remove random noise from natural images.

Considering natural images enjoy such a useful model, one may ask whether the model can be leveraged to decrease the number of measurements necessary to completely determine an image. For example, an MRI scan might require up to 2 hours of exposure time, and then the image might be compressed with JPEG2000 after the fact, meaning most of the measurements can be effectively ignored. So is it possible to simply measure the important parts of the image and not waste time in the image acquisition process? This is the main idea underlying compressed sensing, as introduced by Candes, Romberg and Tao and Donoho.

# Compressive classification and the rare eclipse problem

I recently wrote a paper with Afonso Bandeira and Ben Recht, and now it’s on the arXiv (perhaps you heard from Igor). The main idea is dimensionality reduction for classification, but the motivation is different:

Compressed sensing allows for data recovery after compressive measurements, but classification should require even less information.

Since classification is a many-to-one labeling function, we should be satisfied with classifying using compressive measurements in which points of common labels collide. The main concern is to ensure that the measurements distinguish points of differing labels.

To prove this point, we focused on compressively classifying points from full-dimensional sets. This is a very different model from previous studies of compressive classification, all of which focused on low-dimensionality (taking inspiration from compressed sensing). For 2-label classification tasks, the best-case scenario is to have linear separability, which means there exists a hyperplane that separates the two sets of differing labels. This property allows a classifier to simply threshold an inner product with a certain vector to discern which set a given point must belong to. Linear separability is equivalent to having the convex hulls of the two sets be disjoint. This leads to a rather natural question:

# Sudakov minoration: From Gaussian widths to epsilon nets

In a previous post, I went through Gordon’s escape through a mesh theorem. This theorem is unique because it leverages certain properties of Gaussian processes (and a quantification of size called Gaussian width) instead of passing to an epsilon net. The escape theorem is particularly important to areas like compressed sensing, and provides optimal guarantees in light of phase transitions. However, real-world implementations of compressed sensing tend to not apply Gaussian transforms, but rather random matrices with other distributions (such as Bernoulli or partial Fourier). As such, it is desirable to relax the hypothesis on the transform for the escape theorem to hold.

It is evident that Gaussian width is the correct measure of a set for phase transitions in the Gaussian setting, and we expect similar phase transitions in other settings. As detailed in this paper, an important measure of a set for the other settings is the size of an epsilon net. This leads to the following fundamental question:

What does the Gaussian width of a set say about the smallest possible size of an epsilon net for that set?

# Compressed Sensing 3.0

I’m attending an NSF program review, and Thomas Strohmer gave an awesome talk yesterday. Imagine a scenario in which you are trying to solve a linear inverse problem, but you don’t have complete information about the linear operator. Apparently, this is common when working with small sensors or with high-resolution data. Here’s the problem formulation: You receive

$y=A(\theta)x$,

where $\theta\in\Theta$ is an unknown parameter, $A\colon\Theta\rightarrow\mathbb{R}^{M\times N}$ is a known sensing matrix function (i.e., if you knew $\theta$, then you could determine the sensing matrix $A(\theta)$), and $x\in\mathbb{R}^N$ is an unknown signal exhibiting some notion of sparsity or structure. So what are the conditions that allow one to recover $\theta$ and $x$ from $y$?

In the case where $\theta$ is known, this is simply compressed sensing. Actually, you might call it

• Compressed Sensing 1.0 if $x$ is sparse, i.e., $\#\mathrm{supp}(x)$ is small, or
• Compressed Sensing 2.0 if $x$ has low rank (when reshaped into a matrix).

# Living on the edge: A geometric theory of phase transitions in convex optimization

Phase transitions are very common in modern data analysis (see this paper, for example). The idea is that, for a given task whose inputs are random linear measurements, there is often a magic number of measurements $M^*$ such that if you input $M\ll M^*$ measurements, the task will typically fail, but if you input $M\gg M^*$ measurements, the task will typically succeed. As a toy example, suppose the task is to reconstruct an arbitrary vector in $\mathbb{R}^N$ from its inner products with $M$ random vectors; of course, $M^*=N$ in this case. There are many tasks possible with data analysis, signal processing, etc., and it’s interesting to see what phase transitions emerge in these cases. The following paper introduces some useful techniques for exactly characterizing phase transitions of tasks involving convex optimization:

Living on the edge: A geometric theory of phase transitions in convex optimization

Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp

Along with this paper, I highly recommend watching this lecture by Joel Tropp on the same topic. This blog entry is based on both the paper and the lecture.

# Gordon’s escape through a mesh theorem

In many applications (such as compressed sensing), we use a random projection to capture a subset of $\mathbb{R}^n$. Sometimes, you can show that a random projection performs well by appealing to the Johnson–Lindenstrauss lemma, which says that the pairwise distances of a finite collection of points are nearly preserved (for example, JL implies RIP). In other cases, we might call a projection “good” if its null space avoids a certain subset of $\mathbb{R}^n$ (think the null space property), but how might we show that a random projection is good? By appealing to Gordon’s escape through a mesh theorem, which says that a random subspace avoids a subset (“escapes a mesh”) provided the subset is small in some sense. The purpose of this blog entry is to prove this theorem and provide some intuition.

Throughout, we take $g_k$ to be a $k$-dimensional vector with iid $N(0,1)$ entries, and we denote $a_k:=\mathbb{E}\|g_k\|_2$, which satisfies $k/\sqrt{k+1}. Given a closed subset $S\subseteq\mathbb{R}^n$, then its Gaussian width is defined as

$\displaystyle{w(S):=\mathbb{E}\Big[\max_{x\in S}g_n^\top x\Big]}$.

As we will see, Gaussian width is an important measure of the size of a set. Intuitively, the Gaussian width of $S$ is more or less the expected width of $S$ when measured in a random direction (it’s actually half of the mean width, as defined here), and it plays a crucial role in several results, including the following:

# My research explained, using only the 1000 most common English words

[Inspired by ScottAfonso and Joel, using the Up-Goer Five Text Editor, which in turn was inspired by this xkcd. I actually only use 261 distinct words.]

Let’s say you have a picture, a piece of music, or a movie that you want to store on your computer. You can do this without taking up your entire hard drive, but why? Because there’s a way to look at each of these things so that they appear very simple: Imagine someone is making a movie of you reading this. You’re just sitting there. Maybe a fly is flying around the room, but not much is changing. That means each moment of the movie looks a lot like the one right before, and this makes it very easy to store the entire movie on a computer.

The fact that pictures and such are so simple allows you to do other cool stuff. Let’s say you find your favorite movie in the back of a second-hand store, but when you watch it at home, different marks pop up every now and then. Since movies are so simple, you can use a computer to fill in what you can’t see, and make it good as new.

# Deterministic RIP matrices: Breaking the square-root bottleneck, III

This blog entry is the third in a series which seeks to maximize $\epsilon_0$ such that there exist explicit $M\times N$ matrices of arbitrarily large aspect ratio $N/M$ which are RIP for $K=\Omega(M^{1/2+\epsilon_0})$. Continuing our study of the breakthrough paper by Bourgain et al., this entry will identify what the set $\mathcal{A}$ needs to satisfy, how to construct it, and we will also use this set to complete the proof of the main result. In particular, we will prove a general version of the result which allows for optimizing constants, and then we will perform an optimization routine that improves $\epsilon_0$ by a few orders of magnitude.

— Everything you need to know about $\mathcal{A}$

The main result will use an even number $m$ as a free parameter, which in turn determines another parameter $\alpha=\alpha(m)$. For a given choice of $m$, the proof of the main result requires a subset $\mathcal{A}\subseteq\mathbb{F}_p$ such that (i)

$|\mathcal{A}|\leq p^\alpha \quad (2.3)$

(i.e., $\mathcal{A}$ is sufficiently small), and (ii) for any $a\in\mathcal{A}$, the only way for $a_1,\ldots,a_{2m}\in\mathcal{A}\setminus\{a\}$ to satisfy

$\displaystyle{\sum_{j=1}^m\frac{1}{a-a_j}=\sum_{j=m+1}^{2m}\frac{1}{a-a_j} \quad (2.4)}$

is by taking $(a_1,\ldots,a_m)$ and $(a_{m+1},\ldots,a_{2m})$ to be permutations of each other. Here, division (and addition) is taken in the field $\mathbb{F}_p$.