A geometric intuition for the null space property

Solving underdetermined linear systems is impossible unless you’re given additional information about the solution. The goal of compressed sensing is to solve these systems by assuming the desired solution is sparse. In this vein, one common approach is L1 minimization: Find the minimizer of $\|z\|_1$ subject to $Az=Ax$, where $x$ is the true solution (and so $Ax$ is the available data). Here is the standard picture to illustrate why L1 minimization works:

In this case, we applied the sensing matrix $A=[1,3]$, whose null space is the set of scalar multiples of $[3;-1]$. The vector we are sensing is $x=[0;1]$, and the red line denotes the null space of $A$ shifted by $x$ (i.e., the set of all $z$ such that $Az=Ax$). The blue diamond illustrates the smallest L1-ball which intersects this shifted subspace, and we note that this intersection occurs at our signal $x$. As such, L1 minimization encouraged sparsity so well that it recovered the desired signal exactly.

The purpose of this blog entry is to gain a deeper intuition for why L1 minimization works so well. The following (rather obvious) lemma lies at the heart of this intuition:

Lemma. Fix some sensing matrix $A$, and for every signal $x$, consider the corresponding feasible set

$F_x:=\{z:Az=Ax\}$

and ball

$B_x:=\{z:\|z\|_1\leq\|x\|_1\}$.

Then $x$ uniquely minimizes $\|z\|_1$ subject to $Az=Ax$ if and only if $F_x\cap B_x=\{x\}$.

The beauty of this lemma is that it recasts exact recovery from L1 minimization purely in terms of convex geometry: A good null space for $x$ will be oriented in such a way that its shift by $x$ (namely, $F_x$) will kiss $B_x$ uniquely at $x$. We will apply this understanding to relate the following two concepts:

Definition.

(i) An $M\times N$ sensing matrix $A$ is said to satisfy the exact recovery property for $S\subseteq\{1,\ldots,N\}$ (or $S$-ERP) if every $x$ supported on $S$ uniquely minimizes $\|z\|_1$ subject to $Az=Ax$.

(ii) An $M\times N$ sensing matrix $A$ is said to satisfy the null space property for $S\subseteq\{1,\ldots,N\}$ (or $S$-NSP) if every $h\in\mathrm{null}(A)\setminus\{0\}$ satisfies $\|h_S\|_1<\|h_{S^\mathrm{c}}\|_1$.

The reader who is familiar with compressed sensing is probably aware that a matrix is $S$-ERP for every $S$ of size $K$ if and only if it’s $S$-NSP for every $S$ of size $K$. The following theorem is a more specific version of this result:

Theorem. A matrix is $S$-ERP if and only if it’s $S$-NSP.

Before proving this theorem, we note that in our example above, $A$ is evidently $\{2\}$-ERP, and indeed, $A$ is also $\{2\}$-NSP since every member of the null space has the form $h=[3y;-y]$, which satisfies $\|h_{\{2\}}\|_1=|-y|<|3y|=\|h_{\{1\}}\|_1$ whenever $h\neq0$.

Proof: We will obtain the contrapositive of each direction.

($\Rightarrow$) Suppose $A$ is not $S$-NSP. Then there exists $h\in\mathrm{null}(A)\setminus\{0\}$ such that $\|h_S\|_1\geq \|h_{S^\mathrm{c}}\|_1$. In other words, setting $x=h_S$ and $x^*=-h_{S^\mathrm{c}}$ gives that $x^*\in B_x$. Also, $x^*\in F_x$ since $Ax-Ax^*=Ah=0$, and $x^*\neq x$ since $h\neq 0$. The following illustrates this situation:

In this figure, $F_x$ is denoted by a dotted red line. Overall, we have $F_x\cap B_x\neq \{x\}$ (since it must also contain $x^*\neq x$), and so $A$ is not $S$-ERP by the lemma.

($\Leftarrow$) Suppose $A$ is not $S$-ERP. Then by the lemma, there exists a vector $x$ with support in $S$ such that $F_x\cap B_x$ contains some $x^*\neq x$. Pick $h=x^*-x$. Since $x^*\in F_x$, we have that $h$ is in the null space of $A$, and furthermore, $h\neq 0$ since $x^*\neq x$. The following illustrates this situation:

Evidently, $h_{S^\mathrm{c}}$ is smaller than $h_S$, and this can be established in general by adding a convenient form of zero, and then applying the triangle inequality along with the fact that $\|x^*\|_1\leq\|x\|_1$:

$\|h_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S-x^*_S+x\|_1-\|x\|_1$

$\leq\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S\|_1+\|x^*_S-x\|_1-\|x\|_1$

$=\|x^*\|_1+\|h_S\|_1-\|x\|_1\leq\|h_S\|_1.$

As such, $A$ is not $S$-NSP.     $\Box$

As an aside, let’s consider how to build $S$-NSP matrices. To this end, the restricted isometry property is commonly used as a sufficient condition for the null space property. As you might expect, there’s an analogous condition for $S$-NSP:

Definition. An $M\times N$ sensing matrix $A$ is said to satisfy the restricted isometry property for $S\subseteq\{1,\ldots,N\}$ and $\delta\in[0,1)$ (or $(S,\delta)$-RIP) if

$(1-\delta)\|x-y\|^2\leq\|Ax-Ay\|^2\leq(1+\delta)\|x-y\|^2$

for every $x$ with support in $S$ and every $y$ with support size at most $\#S$.

In words, $(S,\delta)$-RIP matrices preserve distances between vectors supported on $S$ and other sparse vectors. By taking $x=0$, we see that this version of RIP implies the classical version of RIP with sparsity level $\#S$. As such, this is not much of a weakening of RIP. Regardless, the following result follows from the analysis in the classical proof that RIP implies NSP:

Theorem. If $A$ is $(S,\delta)$-RIP with $\delta<\sqrt{2}-1$, then $A$ is $S$-NSP.

Set $K=\#S$. Since $(S,\delta)$-RIP implies the classical $(K,\delta)$-RIP, a random $M\times N$ matrix will need $M=\Omega(K\log(N/K))$ rows to satisfy $(S,\delta)$-RIP. As such, we’re not saving much in the number of measurements compared to drawing a $(2K,\delta)$-RIP matrix to get NSP (i.e., $S$-NSP for every $S$ of size $K$ simultaneously). However, we can certainly lose the log factor, for example, by measuring with $K$ identity basis elements: $\{\delta_n\}_{n\in S}$. This disparity leads to the following problem:

Problem. Given a collection $\mathcal{S}$ of subsets $S\subseteq\{1,\ldots,N\}$, how large must $M$ be for there to exist an $M\times N$ matrix which is $S$-NSP for every $S\in\mathcal{S}$?

Depending on $\mathcal{S}$, the answer can range from $K$ to $O(K\log(N/K))$, and even though the log factor might be negligible, some (real-world) choices for $\mathcal{S}$ might lead to a much smaller constant (which could be important for certain applications).

39 thoughts on “A geometric intuition for the null space property”

1. Amit Unde says:

\|h_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S-x^*_S+x\|_1-\|x\|_1
\leq\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S\|_1+\|x^*_S-x\|_1+\|x\|_1
=\|x^*\|_1+\|h_S\|_1-\|x\|_1\leq\|h_S\|_1.

This part is understood clearly.

2. Amit Unde says:

i am confused completely to sort out the relation between Xs, hs, hs^c, X*s….please help me out

1. x and h are vectors, and S and S^c are sets of indices (S^c is the complement of S). x_S denotes the portion of x which is supported on S, i.e., it equals x on S and otherwise equals zero. The definitions of h_S, etc. are similar.

1. Amit Unde says:

Thank you very much for reply. I am familiar with terminology. I had couple of questions that i am unable to sort out as given in proof of Null Space Property.

1. Why you have taken ||h_sc||= ||x*_sc||?

2. Explain inequalities below that.

2. x is supported on S, so h looks like x^* on S^c. The inequality follows from the fact that x^* has minimal 1-norm, i.e., \|x^*\|_1 \leq \|x\|_1.

3. Amit Unde says:

hey, sorry to disturb you again. But believe me, even i tried hard i am not getting rid of that inequality and the step follow below that. If possible give more explanation about this by solving stepwise. How you applied triangle inequality |a+b|<|a|+|b|?

How ||Xs*-Xs*+X|| – ||X|| is less than or equal to ||Xs*||+||Xs*-X||+||X|| ???

I am from signal processing background, so unable to deal with direct steps.

1. It isn’t — you detected a typo, actually. \|x\|_1 should be negated. I just fixed it — thanks!

4. Amit Unde says:

||Xs*-Xs*+X|| < ||Xs*||+||-Xs*+X||
but as norm is non negative, we can write same thing as,
||Xs*-Xs*+X|| < ||Xs*||+||Xs*-X|| by considering the fact that ||-Xs*+X|| = ||Xs*-X||.

Am I correct?

5. Amit Unde says:

using quasi triangle inequality |a+b|^T Leq |a|^T + |b|^T

how to prove following inequality?

|a+b|^T – |a|^T Geq(>) -|b|^T ??

This inequality used to prove NSP in Lemma1 by Remi Gribonval in the paper Sparse Representations in Unions of Bases. I really appreciate your help.

1. Try the substitution x=a+b and y=-b. Then rearrange

|x+y|^T \leq |x|^T + |y|^T

to get

|x|^T – |x+y|^T \geq -|y|^T.

6. Amit Unde says:

0 < T <= 1 i.e. l1 norm or below that.

7. Amit Unde says:

Wowww!!!!!!!!! One last question from NSP:

Suppose y=AX and X* be a solution of l1 minimization problem. X is k-sparse. S be the set of k largest entries of X. h=X*-X belongs to null space of A. then,

||X*|| Leq ||X||

||X*_s|| + ||X*_sc|| Leq ||X_s|| + ||X_sc||

from triangle inequality,

||X_s|| – ||h_s|| + ||h_sc|| – ||X_sc|| Leq ||X_s|| + ||X_sc|| ????

Again confused about step that uses triangle inequality. Please give explanation if possible.

||X*_sc|| = ||h_sc|| ??

||X_sc|| = 0 ??

||X*_s|| = ||X_s|| – ||h_s|| ?? What is reason for this last equality?

1. Hmm, well first observe that h_S = x*_S – x_S and h_Sc = x*_Sc. Next, reverse triangle inequality gives

||x_S|| – ||h_S||
= ||x_S|| – ||x*_S – x_S||
Leq ||x*_S||.

Thus,

||x_S|| – ||h_S|| + ||h_Sc|| – ||x_Sc||
Leq ||x*_S|| + ||x*_Sc|| – ||x_Sc||
Leq ||x*_S|| + ||x*_Sc||
= ||x*||
Leq ||x||
= ||x_S|| + ||x_Sc||.

8. Amit Unde says:

If possible then make some arrangement to type equation and to attach pdf.

9. Amit Unde says:

In the definition of RIP, it must be (2S, delta) RIP instead of (S,delta) because the support of (x-y) belongs to 2S not S. Right?

If matrix A is (2K,delta) RIP then what it implies: 2K-NSP or K-NSP?
According to your proof “Classical RIP Implies NSP”: (2K,delta) RIP implies K-NSP. Right?

How to prove (K,delta) RIP implies K-NSP?

10. Ahmed Al Hilli says:

Thanks for great explanation. I just need some references for this theorem, I mean the S-NSP and S-RIP conditions. Thank you very much.

1. Ahmed Al Hilli says:

Thank you for your prompt response. I have a question, according to the first paper you posted, it says the sensing matrix should satisfy NSP when equation 3.8 holds for all T in {1,2,…,n}, in your example, it only satisfies 3.8 when T={2} and T^c={1}. What about the other way around? I mean when T={1} and T^c={2}.
Thank you very mcuh

1. Good observation. In my example, the matrix actually doesn’t satisfy the null space property in its entirety. This is an issue with having a 1-dimensional null space in 2 dimensions. However, this issue doesn’t typically occur when the ambient dimension is large (and the null space is appropriately small).

11. stella says:

Thanks for the report. However, I am confused on the k-NSP and 2k-NSP. Shouldn’t it satisfy 2k-NSP to have k-sparsity recovery? Since 2k-NSP suggests that the 2k-sparse x belonging to the null space of A is zero vector? Why here we are suggesting k-NSP?

1. Your intuition is coming from what it takes to ensure that your measurements are different for every k-sparse x. NSP is a condition to further ensure that every k-sparse x can be retrieved by L1 minimization, which certainly requires their measurements to uniquely determine them. Indeed, it is easy to show that Phi doesn’t satisfy k-NSP if there are two different k-sparse vectors x and y with Phi*x=Phi*y.

12. Mariam says:

Dear dustin,

I am new to CS and I am really struggling in some concepts. I will highly appreciate your help! Thanks first for this great article. clarified some things yet I still have some gaps 😦

1- why from the first place we are using the null space ? what does it have to do with the problem.

2- regarding the sensing matrix phi, it should be known at both sender and receiver?

1. 1 – The null space of a matrix A tells you how little you know about the solutions to Ax=b. In the case of compressed sensing, you are told that x is sparse, but if sparse vectors are in the nullspace, you have no hope of recovering them. The nullspace property is a strengthening of this intuition to make L1 minimization work.

2 – Yes, the sensing matrix should be known in order to recover, although there has been some work recently to perform self-calibration in the case where Phi is only known up to certain degrees of freedom. As far as I understand, this is currently the subject of active research. See for example this paper:

https://www.math.ucdavis.edu/~strohmer/papers/2015/SparseLift.pdf

1. Mariam says:

Thanks Dustin for this prompt response 🙂
for question 2, okay I will read the paper
as for question 1, the image is still blurry for me actually
So how does or in what sense is the nullspace property is a strengthening of this intuition to make L1 minimization work?
Seems that I didn’t understand yet the null space property of a matrix
Would you kindly clarify for me ?
and excuse me for disturbance I am really struggling with this point

13. Well, the nullspace property implies that sparse vectors are not in the nullspace. In particular, if a vector h is only supported on S, then h_{S^c}=0, whereas h_S is nonzero. Compare this to the definition of the nullspace property. Also, the nullspace property makes L1 minization work because it implies exact recovery (as we prove in this blog entry).

14. Mariam says:

the null space property definition says that the L1 norm of the non zero set is less than that of the elements of v supoprted of S^c ,
so because of that I guarantee that the L1 norm on S is the minimum?

1. I don’t understand your question. The S-nullspace property is a statement about all vectors in the nullspace, whereas the S-exact recovery property is a statement about all vectors supported on S.

1. Mariam says:

Excuse me I have a big mess in my head… What I am not understanding till now even after reading this blog and other videos is that
1- why L1 guarantees a unique sparsest decomposition.
according to what I have in my mind now, the Line of feasible solutions cannot be in the null space of phi else there will exist more than one solution. (2 points of intersection with the L1 ball)
so x0 (the solution) cannot be in the null space of phi… what’s the relation of this with the definition of NSP and RIP.

2- why the transition from the L0 norm to the L1 norm is valid? that is why if phi obeys RIP and X is sufficiently sparse, then X0 is a common solution to both L0 and L1 minimization problems.

So much thanks for your cooperation.

15. 1- The set of feasible solutions is a shifted version of the nullspace. If there are two sparse solutions, then their difference lies in the nullspace and violates the nullspace property (as well as the restricted isometry property).

2- We talk about L0 minimization as the thing we’d like to do (but can’t, computationally), and then L1 minimization is a thing we can do by linear programming. Intuitively, L1 minimization works because minimizing the L1 norm promotes sparsity (as the first figure in this blog entry illustrates). In terms of guarantees, the nullspace property and restricted isometry properties are sufficient conditions on Phi that make it work, and thankfully, random choices of Phi tend to satisfy these properties.

1. Mariam says:

Dear Dustin,

Finally the image is clear for me now, re-read ur comments and blog and watched this video https://www.youtube.com/watch?v=c6OEZQ3Hhp4

One other question if you don’t mind,
what differs when applying CS to continuous time signals and Discrete time signals? and those that are discrete, they are so by their nature?

1. The theory of compressed sensing has mostly focused on discrete signals, but the applications that I tend to keep in mind are fundamentally continuous: (1) fast MRI scans:

http://statweb.stanford.edu/~donoho/Reports/2007/CSMRI-20071204.pdf

and (2) the single-pixel camera:

http://dsp.rice.edu/cscamera

In order to match the discrete theory with continuous applications, one must wrestle with additional issues such as resolution. For example, how many pixels do you want to represent a given image? In the end, I plan to store the reconstructed image on a computer, so there’s probably no need to worry about continuous-time versions of L1 minimization. (Though there may be some subtleties or other particular applications that I’m failing to consider at the moment.)

16. amit unde says:

What is the actual meaning of M/N in compressive sensing where M represent total number of measurements and n represent length of signal?

1. This is the “compression ratio” in compressed sensing. The closer this is to zero, the more compressive your sensor is.

17. amit unde says:

But is there any relation between the sampling rule and M/N?

1. I’m not sure what you mean by the sampling rule, but the main point of compressed sensing is that you can take M/N to be small provided the signal model is sparse.

2. Mariam says:

I guess he means the nyquist rate.
If this is what you mean, then no there is no relation between CS and Nyquist rate.
In Nyquist, the constraint on the signal is that it should be band limited however in CS, it’s constraint on the signal is different. It only should be sparse as dustin said in his reply