A geometric intuition for the null space property

Solving underdetermined linear systems is impossible unless you’re given additional information about the solution. The goal of compressed sensing is to solve these systems by assuming the desired solution is sparse. In this vein, one common approach is L1 minimization: Find the minimizer of \|z\|_1 subject to Az=Ax, where x is the true solution (and so Ax is the available data). Here is the standard picture to illustrate why L1 minimization works:

In this case, we applied the sensing matrix A=[1,3], whose null space is the set of scalar multiples of [3;-1]. The vector we are sensing is x=[0;1], and the red line denotes the null space of A shifted by x (i.e., the set of all z such that Az=Ax). The blue diamond illustrates the smallest L1-ball which intersects this shifted subspace, and we note that this intersection occurs at our signal x. As such, L1 minimization encouraged sparsity so well that it recovered the desired signal exactly.

The purpose of this blog entry is to gain a deeper intuition for why L1 minimization works so well. The following (rather obvious) lemma lies at the heart of this intuition:

Lemma. Fix some sensing matrix A, and for every signal x, consider the corresponding feasible set

F_x:=\{z:Az=Ax\}

and ball

B_x:=\{z:\|z\|_1\leq\|x\|_1\}.

Then x uniquely minimizes \|z\|_1 subject to Az=Ax if and only if F_x\cap B_x=\{x\}.

The beauty of this lemma is that it recasts exact recovery from L1 minimization purely in terms of convex geometry: A good null space for x will be oriented in such a way that its shift by x (namely, F_x) will kiss B_x uniquely at x. We will apply this understanding to relate the following two concepts:

Definition.

(i) An M\times N sensing matrix A is said to satisfy the exact recovery property for S\subseteq\{1,\ldots,N\} (or S-ERP) if every x supported on S uniquely minimizes \|z\|_1 subject to Az=Ax.

(ii) An M\times N sensing matrix A is said to satisfy the null space property for S\subseteq\{1,\ldots,N\} (or S-NSP) if every h\in\mathrm{null}(A)\setminus\{0\} satisfies \|h_S\|_1<\|h_{S^\mathrm{c}}\|_1.

The reader who is familiar with compressed sensing is probably aware that a matrix is S-ERP for every S of size K if and only if it’s S-NSP for every S of size K. The following theorem is a more specific version of this result:

Theorem. A matrix is S-ERP if and only if it’s S-NSP.

Before proving this theorem, we note that in our example above, A is evidently \{2\}-ERP, and indeed, A is also \{2\}-NSP since every member of the null space has the form h=[3y;-y], which satisfies \|h_{\{2\}}\|_1=|-y|<|3y|=\|h_{\{1\}}\|_1 whenever h\neq0.

Proof: We will obtain the contrapositive of each direction.

(\Rightarrow) Suppose A is not S-NSP. Then there exists h\in\mathrm{null}(A)\setminus\{0\} such that \|h_S\|_1\geq \|h_{S^\mathrm{c}}\|_1. In other words, setting x=h_S and x^*=-h_{S^\mathrm{c}} gives that x^*\in B_x. Also, x^*\in F_x since Ax-Ax^*=Ah=0, and x^*\neq x since h\neq 0. The following illustrates this situation:

In this figure, F_x is denoted by a dotted red line. Overall, we have F_x\cap B_x\neq \{x\} (since it must also contain x^*\neq x), and so A is not S-ERP by the lemma.

(\Leftarrow) Suppose A is not S-ERP. Then by the lemma, there exists a vector x with support in S such that F_x\cap B_x contains some x^*\neq x. Pick h=x^*-x. Since x^*\in F_x, we have that h is in the null space of A, and furthermore, h\neq 0 since x^*\neq x. The following illustrates this situation:

Evidently, h_{S^\mathrm{c}} is smaller than h_S, and this can be established in general by adding a convenient form of zero, and then applying the triangle inequality along with the fact that \|x^*\|_1\leq\|x\|_1:

\|h_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S-x^*_S+x\|_1-\|x\|_1

\leq\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S\|_1+\|x^*_S-x\|_1-\|x\|_1

=\|x^*\|_1+\|h_S\|_1-\|x\|_1\leq\|h_S\|_1.

As such, A is not S-NSP.     \Box

As an aside, let’s consider how to build S-NSP matrices. To this end, the restricted isometry property is commonly used as a sufficient condition for the null space property. As you might expect, there’s an analogous condition for S-NSP:

Definition. An M\times N sensing matrix A is said to satisfy the restricted isometry property for S\subseteq\{1,\ldots,N\} and \delta\in[0,1) (or (S,\delta)-RIP) if

(1-\delta)\|x-y\|^2\leq\|Ax-Ay\|^2\leq(1+\delta)\|x-y\|^2

for every x with support in S and every y with support size at most \#S.

In words, (S,\delta)-RIP matrices preserve distances between vectors supported on S and other sparse vectors. By taking x=0, we see that this version of RIP implies the classical version of RIP with sparsity level \#S. As such, this is not much of a weakening of RIP. Regardless, the following result follows from the analysis in the classical proof that RIP implies NSP:

Theorem. If A is (S,\delta)-RIP with \delta<\sqrt{2}-1, then A is S-NSP.

Set K=\#S. Since (S,\delta)-RIP implies the classical (K,\delta)-RIP, a random M\times N matrix will need M=\Omega(K\log(N/K)) rows to satisfy (S,\delta)-RIP. As such, we’re not saving much in the number of measurements compared to drawing a (2K,\delta)-RIP matrix to get NSP (i.e., S-NSP for every S of size K simultaneously). However, we can certainly lose the log factor, for example, by measuring with K identity basis elements: \{\delta_n\}_{n\in S}. This disparity leads to the following problem:

Problem. Given a collection \mathcal{S} of subsets S\subseteq\{1,\ldots,N\}, how large must M be for there to exist an M\times N matrix which is S-NSP for every S\in\mathcal{S}?

Depending on \mathcal{S}, the answer can range from K to O(K\log(N/K)), and even though the log factor might be negligible, some (real-world) choices for \mathcal{S} might lead to a much smaller constant (which could be important for certain applications).

Advertisements

39 thoughts on “A geometric intuition for the null space property

  1. \|h_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1=\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S-x^*_S+x\|_1-\|x\|_1
    \leq\|x^*_{S^\mathrm{c}}\|_1+\|x^*_S\|_1+\|x^*_S-x\|_1+\|x\|_1
    =\|x^*\|_1+\|h_S\|_1-\|x\|_1\leq\|h_S\|_1.

    This part is understood clearly.

    1. x and h are vectors, and S and S^c are sets of indices (S^c is the complement of S). x_S denotes the portion of x which is supported on S, i.e., it equals x on S and otherwise equals zero. The definitions of h_S, etc. are similar.

      1. Thank you very much for reply. I am familiar with terminology. I had couple of questions that i am unable to sort out as given in proof of Null Space Property.

        1. Why you have taken ||h_sc||= ||x*_sc||?

        2. Explain inequalities below that.

  2. hey, sorry to disturb you again. But believe me, even i tried hard i am not getting rid of that inequality and the step follow below that. If possible give more explanation about this by solving stepwise. How you applied triangle inequality |a+b|<|a|+|b|?

    How ||Xs*-Xs*+X|| – ||X|| is less than or equal to ||Xs*||+||Xs*-X||+||X|| ???

    I am from signal processing background, so unable to deal with direct steps.

  3. ||Xs*-Xs*+X|| < ||Xs*||+||-Xs*+X||
    but as norm is non negative, we can write same thing as,
    ||Xs*-Xs*+X|| < ||Xs*||+||Xs*-X|| by considering the fact that ||-Xs*+X|| = ||Xs*-X||.

    Am I correct?

  4. using quasi triangle inequality |a+b|^T Leq |a|^T + |b|^T

    how to prove following inequality?

    |a+b|^T – |a|^T Geq(>) -|b|^T ??

    This inequality used to prove NSP in Lemma1 by Remi Gribonval in the paper Sparse Representations in Unions of Bases. I really appreciate your help.

  5. Wowww!!!!!!!!! One last question from NSP:

    Suppose y=AX and X* be a solution of l1 minimization problem. X is k-sparse. S be the set of k largest entries of X. h=X*-X belongs to null space of A. then,

    ||X*|| Leq ||X||

    ||X*_s|| + ||X*_sc|| Leq ||X_s|| + ||X_sc||

    from triangle inequality,

    ||X_s|| – ||h_s|| + ||h_sc|| – ||X_sc|| Leq ||X_s|| + ||X_sc|| ????

    Again confused about step that uses triangle inequality. Please give explanation if possible.

    ||X*_sc|| = ||h_sc|| ??

    ||X_sc|| = 0 ??

    ||X*_s|| = ||X_s|| – ||h_s|| ?? What is reason for this last equality?

    1. Hmm, well first observe that h_S = x*_S – x_S and h_Sc = x*_Sc. Next, reverse triangle inequality gives

      ||x_S|| – ||h_S||
      = ||x_S|| – ||x*_S – x_S||
      Leq ||x*_S||.

      Thus,

      ||x_S|| – ||h_S|| + ||h_Sc|| – ||x_Sc||
      Leq ||x*_S|| + ||x*_Sc|| – ||x_Sc||
      Leq ||x*_S|| + ||x*_Sc||
      = ||x*||
      Leq ||x||
      = ||x_S|| + ||x_Sc||.

  6. In the definition of RIP, it must be (2S, delta) RIP instead of (S,delta) because the support of (x-y) belongs to 2S not S. Right?

    If matrix A is (2K,delta) RIP then what it implies: 2K-NSP or K-NSP?
    According to your proof “Classical RIP Implies NSP”: (2K,delta) RIP implies K-NSP. Right?

    How to prove (K,delta) RIP implies K-NSP?

    Please do reply….

  7. Thanks for great explanation. I just need some references for this theorem, I mean the S-NSP and S-RIP conditions. Thank you very much.

    1. Thank you for your prompt response. I have a question, according to the first paper you posted, it says the sensing matrix should satisfy NSP when equation 3.8 holds for all T in {1,2,…,n}, in your example, it only satisfies 3.8 when T={2} and T^c={1}. What about the other way around? I mean when T={1} and T^c={2}.
      Thank you very mcuh

      1. Good observation. In my example, the matrix actually doesn’t satisfy the null space property in its entirety. This is an issue with having a 1-dimensional null space in 2 dimensions. However, this issue doesn’t typically occur when the ambient dimension is large (and the null space is appropriately small).

  8. Thanks for the report. However, I am confused on the k-NSP and 2k-NSP. Shouldn’t it satisfy 2k-NSP to have k-sparsity recovery? Since 2k-NSP suggests that the 2k-sparse x belonging to the null space of A is zero vector? Why here we are suggesting k-NSP?

    Thanks! Please reply:-)

    1. Your intuition is coming from what it takes to ensure that your measurements are different for every k-sparse x. NSP is a condition to further ensure that every k-sparse x can be retrieved by L1 minimization, which certainly requires their measurements to uniquely determine them. Indeed, it is easy to show that Phi doesn’t satisfy k-NSP if there are two different k-sparse vectors x and y with Phi*x=Phi*y.

  9. Dear dustin,

    I am new to CS and I am really struggling in some concepts. I will highly appreciate your help! Thanks first for this great article. clarified some things yet I still have some gaps 😦

    1- why from the first place we are using the null space ? what does it have to do with the problem.

    2- regarding the sensing matrix phi, it should be known at both sender and receiver?

    1. 1 – The null space of a matrix A tells you how little you know about the solutions to Ax=b. In the case of compressed sensing, you are told that x is sparse, but if sparse vectors are in the nullspace, you have no hope of recovering them. The nullspace property is a strengthening of this intuition to make L1 minimization work.

      2 – Yes, the sensing matrix should be known in order to recover, although there has been some work recently to perform self-calibration in the case where Phi is only known up to certain degrees of freedom. As far as I understand, this is currently the subject of active research. See for example this paper:

      https://www.math.ucdavis.edu/~strohmer/papers/2015/SparseLift.pdf

      1. Thanks Dustin for this prompt response 🙂
        for question 2, okay I will read the paper
        as for question 1, the image is still blurry for me actually :/
        So how does or in what sense is the nullspace property is a strengthening of this intuition to make L1 minimization work?
        Seems that I didn’t understand yet the null space property of a matrix
        Would you kindly clarify for me ?
        and excuse me for disturbance I am really struggling with this point

  10. Well, the nullspace property implies that sparse vectors are not in the nullspace. In particular, if a vector h is only supported on S, then h_{S^c}=0, whereas h_S is nonzero. Compare this to the definition of the nullspace property. Also, the nullspace property makes L1 minization work because it implies exact recovery (as we prove in this blog entry).

  11. the null space property definition says that the L1 norm of the non zero set is less than that of the elements of v supoprted of S^c ,
    so because of that I guarantee that the L1 norm on S is the minimum?

    1. I don’t understand your question. The S-nullspace property is a statement about all vectors in the nullspace, whereas the S-exact recovery property is a statement about all vectors supported on S.

      1. Excuse me I have a big mess in my head… What I am not understanding till now even after reading this blog and other videos is that
        1- why L1 guarantees a unique sparsest decomposition.
        according to what I have in my mind now, the Line of feasible solutions cannot be in the null space of phi else there will exist more than one solution. (2 points of intersection with the L1 ball)
        so x0 (the solution) cannot be in the null space of phi… what’s the relation of this with the definition of NSP and RIP.

        2- why the transition from the L0 norm to the L1 norm is valid? that is why if phi obeys RIP and X is sufficiently sparse, then X0 is a common solution to both L0 and L1 minimization problems.

        So much thanks for your cooperation.

  12. 1- The set of feasible solutions is a shifted version of the nullspace. If there are two sparse solutions, then their difference lies in the nullspace and violates the nullspace property (as well as the restricted isometry property).

    2- We talk about L0 minimization as the thing we’d like to do (but can’t, computationally), and then L1 minimization is a thing we can do by linear programming. Intuitively, L1 minimization works because minimizing the L1 norm promotes sparsity (as the first figure in this blog entry illustrates). In terms of guarantees, the nullspace property and restricted isometry properties are sufficient conditions on Phi that make it work, and thankfully, random choices of Phi tend to satisfy these properties.

      1. The theory of compressed sensing has mostly focused on discrete signals, but the applications that I tend to keep in mind are fundamentally continuous: (1) fast MRI scans:

        http://statweb.stanford.edu/~donoho/Reports/2007/CSMRI-20071204.pdf

        and (2) the single-pixel camera:

        http://dsp.rice.edu/cscamera

        In order to match the discrete theory with continuous applications, one must wrestle with additional issues such as resolution. For example, how many pixels do you want to represent a given image? In the end, I plan to store the reconstructed image on a computer, so there’s probably no need to worry about continuous-time versions of L1 minimization. (Though there may be some subtleties or other particular applications that I’m failing to consider at the moment.)

  13. What is the actual meaning of M/N in compressive sensing where M represent total number of measurements and n represent length of signal?

    1. I guess he means the nyquist rate.
      If this is what you mean, then no there is no relation between CS and Nyquist rate.
      In Nyquist, the constraint on the signal is that it should be band limited however in CS, it’s constraint on the signal is different. It only should be sparse as dustin said in his reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s