Deterministic RIP matrices: Breaking the square-root bottleneck

Unless you’ve been living under a rock, you probably know about Polymath8, and how it has been a dramatic success. In May, Yitang Zhang published a proof that there are infinitely many pairs of primes which differ by 70 million or less, and since then, the Polymath project exploded to decrease this number considerably. In a recent blog entry, Terry Tao took some time to reflect on why this Polymath project received so much attention. Here is one of the reasons he points out:

One could summarise the current state of progress by a single natural number $H$, which implied by infinite descent that the project was guaranteed to terminate at some point, but also made it possible to set up a “scoreboard” that could be quickly and easily updated.

In reading this, I was reminded of another big open problem — perhaps the biggest open problem in compressed sensing:

The deterministic RIP matrix problem. Construct an infinite family of deterministic matrices $\{\Phi_M\}$, where $\Phi_M$ is $M\times N(M)$ and $M$ and $N(M)/M$ are both arbitrarily large, such that each $\Phi_M$ satisfies the $(K,\delta)$-restricted isometry property (RIP) with $K=\Omega(M^{1-\epsilon})$ for all $\epsilon>0$ and with $\delta<1/2$.

For those who don’t know, a matrix $\Phi$ is said to be $(K,\delta)$-RIP if for every vector $x$ with at most $K$ nonzero entries, we have

$(1-\delta)\|x\|^2\leq\|\Phi x\|^2\leq(1+\delta)\|x\|^2$.

Matrices which satisfy this property are particularly well suited as compressed sensing matrices, as they allow sparse signals $x$ to be efficiently reconstructed from measurements of the form $y=\Phi x$ (and noisy variants thereof). At the moment, the RIP matrices with the fewest number of rows (i.e., fewest number of measurements) are constructed using random processes: If $\Phi$ is $M\times N$, then $M$ can scale like $(K/\delta^2)\log(N/K)$. However, deterministic matrices are almost invariably shown to be RIP using coherence-based arguments (e.g., using the Gershgorin circle theorem to produce a bound on $\delta$), and these arguments inevitably lead to $M$ scaling like $K^2$ (i.e., many more measurements than in the random case).

Three years after Terry Tao posed this problem on his blog, the problem was partially solved in this breakthrough paper:

Explicit constructions of RIP matrices and related problems

J. Bourgain, S. Dilworth, K. Ford, S. Konyagin, D. Kutzarova

This paper provides an explicit family of RIP matrices (of arbitrarily large aspect ratio) in which $K$ scales like $M^{1/2+\epsilon_0}$ for some $\epsilon_0>0$. Instead of applying the Gershgorin circle theorem to estimate $\delta$ in terms of coherence, they leverage additive combinatorics to demonstrate certain cancellations in the Gram matrix $\Phi^*\Phi$. Today (three years later), this is the only known deterministic matrix construction which breaks the square-root bottleneck.

It appears as though the paper by Bourgain et al. provides the only known techniques to solve the deterministic RIP matrix problem. This leads to three natural questions:

1. What are the techniques?

2. How big is $\epsilon_0$?

and in the spirit of Polymath8:

3. Can we optimize their analysis to increase $\epsilon_0$?

The purpose of this blog entry is to start answering these questions, and perhaps kickstart an online project to increase $\epsilon_0$.

— The big-picture techniques —

Before explaining how Bourgain et al. beat the Gershgorin technique, let’s briefly discuss this technique. First, let $\Phi_\mathcal{K}$ denote the submatrix of $\Phi$ whose columns are indexed by $\mathcal{K}\subseteq\{1,\ldots,N\}$. Then $(K,\delta)$-RIP equivalently states that, for every $\mathcal{K}$ of size $K$, the eigenvalues of $\Phi_\mathcal{K}^*\Phi_\mathcal{K}$ lie in $[1-\delta,1+\delta]$. As such, we can prove that a matrix is RIP by approximating eigenvalues. To this end, if we assume the columns of $\Phi$ have unit norm, and if we let $\mu$ denote the largest off-diagonal entry of $\Phi^*\Phi$ in absolute value (this is the worst-case coherence of the columns of $\Phi$), then the Gershgorin circle theorem implies that $\Phi$ is $(K,(K-1)\mu)$-RIP. Unfortunately, the coherence can’t be too small, due to the Welch bound:

$\displaystyle{\mu\geq\sqrt{\frac{N-M}{M(N-1)}}},$

which is $\Omega(M^{-1/2})$ provided $N\geq cM$ for some $c>1$. Thus, to get $(K-1)\mu=\delta<1/2$, we require $K<1/(2\mu)+1=O(M^{1/2})$. This is much smaller than the random RIP constructions which instead take $K=O(M/\log N)$, thereby revealing the shortcoming of the Gershgorin technique. We’ll now discuss the alternative techniques that Bourgain et al. use. To be clear, this blog entry will not discuss the techniques which are specific to their particular matrix construction — these are where additive combinatorics make an appearance, and will be the subject of a future blog entry.

The main idea is to convert the RIP statement, which concerns all $K$-sparse vectors simultaneously, into a statement about finitely many vectors:

Definition (flat RIP). We say $\Phi=[\varphi_1\cdots\varphi_N]$ satisfies $(K,\theta)$flat RIP if for every disjoint $I,J\subseteq\{1,\ldots,N\}$ of size $\leq K$,

$\displaystyle{\bigg|\bigg\langle\sum_{i\in I}\varphi_i,\sum_{j\in J}\varphi_j\bigg\rangle\bigg|\leq \theta\sqrt{|I||J|}}$.

Lemma A (essentially Lemma 3, cf. Theorem 13 in this paper). If $\Phi$ has $(K,\theta)$-flat RIP and unit-norm columns, then $\Phi$ has $(K,150\theta\log K)$-RIP.

Unlike the coherence argument, flat RIP doesn’t lead to much loss in $K$. In particular, this paper shows that random matrices satisfy $(K,\theta)$-flat RIP with $\theta=O(\delta/\log K)$ when $M=O((K/\delta^2)\log^2 K\log N)$. As such, it makes sense that flat RIP would be a vehicle to break the square-root bottleneck. However, in practice, it’s difficult to control both the left- and right-hand sides of the flat RIP inequality — it would be much easier if we only had to worry about getting cancellations, and not getting different levels of cancellation for different-sized subsets. This leads to the following:

Definition (weak flat RIP). We say $\Phi=[\varphi_1\cdots\varphi_N]$ satisfies $(K,\theta')$weak flat RIP if for every disjoint $I,J\subseteq\{1,\ldots,N\}$ of size $\leq K$,

$\displaystyle{\bigg|\bigg\langle\sum_{i\in I}\varphi_i,\sum_{j\in J}\varphi_j\bigg\rangle\bigg|\leq \theta' K}$.

Lemma B (essentially Lemma 1). If $\Phi$ has $(K,\theta')$-weak flat RIP and worst-case coherence $\mu\leq1/K$, then $\Phi$ has $(K,\sqrt{\theta'})$-flat RIP.

Proof: By the triangle inequality, we have

$\displaystyle{\bigg|\bigg\langle\sum_{i\in I}\varphi_i,\sum_{j\in J}\varphi_j\bigg\rangle\bigg|\leq\sum_{i\in I}\sum_{j\in J}|\langle\varphi_i,\varphi_j\rangle|\leq |I||J|\mu\leq |I||J|/K}.$

Since $\Phi$ also has weak flat RIP, we then have

$\displaystyle{\bigg|\bigg\langle\sum_{i\in I}\varphi_i,\sum_{j\in J}\varphi_j\bigg\rangle\bigg|\leq\min\{\theta'K,|I||J|/K\}\leq\sqrt{\theta'|I||J|}}.\qquad\Box$

Unfortunately, this coherence requirement puts $K$ back in the square-root bottleneck, since $\mu\leq1/K$ is equivalent to $K\leq1/\mu=O(M^{1/2})$. To rectify this, they use a trick in which a modest $K$ with tiny $\delta$ can be converted to a large $K$ with modest $\delta$:

Lemma C (buried in Lemma 3, cf. Theorem 1 in this paper). If $\Phi$ has $(K,\delta)$-RIP, then $\Phi$ has $(sK,2s\delta)$-RIP for all $s\geq1$.

This paper used this trick to get RIP results for larger $K$ when testing RIP for smaller $K$. For the deterministic RIP matrix problem, we are stuck with proving how small $\delta$ is when $K$ on the order of $M^{1/2}$. Note that this trick will inherently exhibit some loss in $K$. Assuming the best possible scaling for all $N$, $K$ and $\delta$ is $M=\Theta((K/\delta^2)\log(N/K))$, then if $N=\mathrm{poly}(M)$, you can get $(M^{1/2},\delta)$-RIP only if $\delta=\Omega((\log^{1/2}M)/M^{1/4})$. In this best-case scenario, you would want to pick $s=M^{1/4-\epsilon}$ for some $\epsilon>0$ and apply Lemma C to get $K=O(M^{3/4-\epsilon})$. In some sense, this is another manifestation of the square-root bottleneck, but it would still be a huge achievement to saturate this bound.

— How big is $\epsilon_0$? —

Before evaluating how big $\epsilon_0$ actually is, first note that the goal of this paper was only to establish that the bottleneck is breakable, i.e., to prove the following theorem:

Theorem (Theorem 1). There is an effective constant $\epsilon_0>0$ and an explicit number $M_0$ such that, for any positive integers $M\geq M_0$ and $M\leq N\leq M^{1+\epsilon_0}$, there is an explicit $M\times N$ matrix which is $(M^{1/2+\epsilon_0},M^{-\epsilon_0})$-RIP.

Due to this difference in objective, the analysis in this paper is perhaps ripe for optimization. But before we start optimizing, it would nice to see what the current record is. Halfway down page 155 of the journal version of the paper, the authors discuss how to use their results to produce the $\epsilon_0$ in the above theorem (note: I think the references to Corollary 1 in this paragraph should actually be references to Lemma 1). The main ingredient here is Lemma 2, which gives that their explicit construction is $(M^{1/2},M^{-\epsilon_1})$-weak flat RIP with

$\displaystyle{\epsilon_1=\frac{c_0\gamma}{20}-\frac{43\alpha}{m}}.$

Here, $c_0=1/10430$ is a constant from Proposition 2 of this paper, we can take $\gamma=\beta/50$ in accordance with Corollary 4, and $\alpha=1/(8m^2)$ and $\beta=1/(16m^2)$ are parameters of the matrix construction. Finally, $m$ is a free variable, constrained to be a positive even integer. To prove Theorem 1, they take $\epsilon_0=\epsilon_1/5$, but since our problem only requires $\delta<1/2$, we will instead take $\epsilon_0=\epsilon_1/2-\epsilon$ for any $\epsilon>0$. Indeed, $(M^{1/2},M^{-\epsilon_1})$-weak flat RIP implies $(M^{1/2},M^{-\epsilon_1/2})$-flat RIP by Lemma B, which in turn implies $(M^{1/2},75M^{-\epsilon_1/2}\log M)$-RIP by Lemma A, and so taking $s=M^{\epsilon_0}$ and applying Lemma C gives $(M^{1/2+\epsilon_0},150M^{-\epsilon}\log M)$-RIP.

At this point, we wish to establish how big $\epsilon_0=\epsilon_1/2-\epsilon$ can be given

$\displaystyle{\epsilon_1=\frac{c_0\gamma}{20}-\frac{43\alpha}{m}=\frac{a}{m^2}-\frac{b}{m^3}},$

where

$\displaystyle{a=\frac{1}{10430\cdot50\cdot20\cdot16},\quad b=\frac{43}{8}}.$

By elementary calculus, we have

$\displaystyle{\max_{m>0}\bigg(\frac{a}{m^2}-\frac{b}{m^3}\bigg)=\frac{4a^3}{27b^2}},$

with the maximum occurring at $m=(3b)/(2a)$. In our case, this maximizer is $m=1,345,470,000$, and the maximum $\epsilon_1$ leads to a maximal $\epsilon_0$ of

$\displaystyle{\epsilon_0=\frac{1}{1,812,606,691,486,752,000,000,000,000}-\epsilon}.$

Overall, the analysis (as is) leads to $\epsilon_0\approx 5.5169\times 10^{-28}$. This is the record to beat.

Considering the calculation of $\epsilon_0$, it is clear that there are several parameters whose improvement would lead to big increases in $\epsilon_0$. By far the biggest contribution to the denominator of $\epsilon_0$ is $c_0$, and the authors remark that optimizing this constant would be interesting in its own right (as a result in additive combinatorics). It’s unclear to me how amenable this constant is to optimization, but it’s certainly worth a try.

At the moment, I think the easiest gains will come from decreasing the $20$ in $a$ and the $43$ in $b$, which come from the proofs of Lemmas 2 and 10. As an example (and a starting point), I’ll use the rest of this blog entry to decrease the $20$ in $a$.

At the end of page 173 of the journal version (in the proof of Lemma 10 after (6.14)), the authors use the estimates

• $\|\lambda_1\|_2\leq\sqrt{m!}p^{-m\gamma/4}$
• $p^{0.44}\leq|B-B|\leq p^{0.78}$

along with the assumption $m\gamma\leq1/3$ to conclude

$\|\lambda_1\|_2+|B-B|^{-1/2}+|B-B|^{1/2}p^{-1/2}\leq\sqrt{m!}p^{-m\gamma/4}+p^{-0.1}\leq p^{-m\gamma/5}$

for sufficiently large $p$. Perhaps a more conservative approach is to first use precisely what the estimates give:

$\|\lambda_1\|_2+|B-B|^{-1/2}+|B-B|^{1/2}p^{-1/2}\leq\sqrt{m!}p^{-m\gamma/4}+p^{-0.22}+p^{-0.11}.$

Then since $m\gamma\leq1/3$ by assumption (we actually don’t require this much, but fine), we have that $\sqrt{m!}p^{-m\gamma/4}$ is the dominant term on the right-hand side. As such, for every $\epsilon>0$,

$\sqrt{m!}p^{-m\gamma/4}+p^{-0.22}+p^{-0.11}\leq p^{-m\gamma/(4+\epsilon)}$

for sufficiently large $p$. In effect, this replaces the $5$‘s in the proof (which are eventually multiplied by $4$ to form the $20$ in $a$) with $(4+\epsilon)$‘s, and so the $20$ in $a$ is effectively replaced with a $16$. Overall, this slight modification increases $\epsilon_0$ by a factor of $(20/16)^3=125/64$, nearly doubling the state of the art’s position beyond the square-root bottleneck. Indeed, we now have

$\displaystyle{\epsilon_0=\frac{1}{928,054,626,041,217,024,000,000,000}-\epsilon},$

or approximately $1.0775\times 10^{-27}$. I’m interested to see how far this can be taken.

UPDATE: Just launched a wiki to keep track of the progress.

1. It just occurred to me that, considering the equation after (2.2), we can afford to change the definition of $\beta$ from $1/(16m^2)$ to $1/(8^\circ m^2)$ — here, the $\circ$ uses Bandeira’s notation:
The point here is that the deterministic RIP matrix problem only requires a family of matrices with arbitrarily large aspect ratio, and we don’t require the rate at which this ratio increases to be related to $\epsilon_0$ (unlike Theorem 1 in Bourgain et al.). This change increases $\epsilon_0$ by a factor of $8=2^3$, so I’ll update the wiki accordingly.