Phase retrieval: Stability and recovery guarantees

I wanted to start this year strong, so I asked Aaron A. Nelson (a master’s student at AFIT) to write a blog entry for me:

Given an ensemble of measurement vectors \left\{\varphi_n\right\}_{n=1}^{N}\subseteq\mathbb{R}^M, the problem of recovering a signal x\in\mathbb{R}^M from the quadratic data |\langle x,\varphi_n\rangle|^2 is known as phase retrieval.  Naturally, one would like to know when this measurement process is even stable.  This issue is tackled in the following paper:

Phase retrieval: Stability and recovery guarantees

Yonina C. Eldar, Shahar Mendelson

Let \mathcal{A}\colon\mathbb{R}^M/\{\pm1\}\rightarrow\mathbb{R}^N where (\mathcal{A}(x))(n)=|\langle x,\varphi_n\rangle|^2.  Then the notion of stability is a statement on the difference \left\Vert\mathcal{A}(x)-\mathcal{A}(y)\right\Vert_1 for all x\neq\pm y in \mathbb{R}^M.  For this paper, Eldar and Mendelson define the mapping \mathcal{A} to be stable in a set T\subseteq\mathbb{R}^M if there exists a constant C>0 such that

\displaystyle{\left\Vert\mathcal{A}(x)-\mathcal{A}(y)\right\Vert_1\geq C\left\Vert x-y\right\Vert_2\left\Vert x+y\right\Vert_2 \qquad \forall x,y\in T}.

Notice that this definition is a stronger statement than injectivity, which only requires that \left\Vert\mathcal{A}(x)-\mathcal{A}(y)\right\Vert_1>0 for all x\neq\pm y in T.  Basically, Eldar and Mendelson’s notion of stability further says that whenever \mathcal{A}(x) and \mathcal{A}(y) are close in the \ell_1 sense, then x must also be close to either y or -y in the \ell_2 sense.  With this in mind, we see that the constant C plays an important role in stability:  If \mathcal{A}(x) and \mathcal{A}(y) are close and C is large, then x must be really close to either y or -y.  Intuitively speaking, the bigger C is, the more stable \mathcal{A} is.

Eldar and Mendelson’s first main stability result (Theorem 2.4) provides an expression for C in the special case where \left\{\varphi_n\right\}_{n=1}^{N} are independent random vectors with identical isotropic, subgaussian distribution:

\displaystyle{C=\kappa(x-y,x+y)-c\rho_{T,N}},

with some (large) probability depending only on the constant c.  Here, \kappa(u,v)=\mathbb{E}|\langle u/\left\Vert u\right\Vert_2,\varphi\rangle\langle v/\left\Vert v\right\Vert_2,\varphi\rangle|, where \varphi is a random vector with the same distribution as each of the \varphi_n‘s.  Also, \rho_{T,N}=\ell(T)/\sqrt{N}+\ell^2(T)/N, where \ell(T)=\mathbb{E}\sup_{t\in T}{|\langle t,z\rangle|} is the Gaussian complexity of T with z being a Gaussian random vector.  Thus, we get more stability if we construct our measurement ensemble (really, the distribution of \varphi) in a way that makes \kappa large and \rho_{T,N} small.

Interestingly, the function \kappa contains some important information about the ensemble \left\{\varphi_n\right\}_{n=1}^{N}.  To be clear, we say \left\{\varphi_n\right\}_{n=1}^{N} has the complement property if for every S\subseteq\{1,\ldots,N\}, either \left\{\varphi_n\right\}_{n\in S} or \left\{\varphi_n\right\}_{n\in S^{\mathrm{c}}} spans \mathbb{R}^{M}.  It was shown by Balan, Casazza and Edidin that \left\{\varphi_n\right\}_{n=1}^{N} has the complement property if and only if \mathcal{A} is injective.  Notice that \left\{\varphi_n\right\}_{n=1}^{N} has the complement property precisely when

\displaystyle{\max_{n\in\{1,\ldots,N\}}|\langle u/\left\Vert u\right\Vert_2,\varphi_n\rangle\langle v/\left\Vert v\right\Vert_2,\varphi_n\rangle|>0 \qquad \forall u,v\in\mathbb{R}^M\setminus\{0\}}.

Furthermore, since the maximum is greater than the average, which in this case tends to be a good proxy for the expected value \kappa(u,v), we see that having \kappa large intuitively ensures that \mathcal{A} is “very” injective (i.e., stable).  Overall, \kappa provides a sort of numerical version of the complement property.

As for the Gaussian complexity \rho_{T,N}, this is a measure of the best correlation between T and a random sequence and can be thought of as the expected width of T in a randomly selected direction.  For example, if T is taken to be an ellipsoid, the complexity measure is a sort of average of the lengths of its semi-principal axes.  This makes sense intuitively in the context of Theorem 2.4:  If T has a smaller width, this forces x to be closer to \pm y, thereby making the measurement process inherently more stable.

Considering Theorem 2.4, notice that for a fixed set T, the only dependence on the number of intensity measurements N is in \rho_{T,N}.  Thus to get stability, we only need to pick N so that \rho_{T,N} is sufficiently small.  This is precisely what Eldar and Mendelson do – for instance, they determine that stability requires \mathcal{O}(M) measurements when T=\mathbb{R}^M and \mathcal{O}(k\log{(eM/k)}) measurements when T is the set of k-sparse vectors.

Now consider the mapping x\mapsto\left\{|\langle x,\varphi_n\rangle|^2\right\}_{n=1}^{N} in the presence of noise.  That is, let w\in\mathbb{R}^N be a random noise vector such that the quadratic data for a signal x is given by \left\{|\langle x,\varphi_n\rangle|^2+w_n\right\}_{n=1}^N.  Typically, the random noise entries are modeled as iid mean zero random variables.  Eldar and Mendelson use a \psi_2 distribution for w; that is, w is taken according to the random variable W satisfying

\displaystyle{\left\Vert W\right\Vert_{\psi_p}=\inf{\left\{C>0:\mathbb{E}\exp{(|W/C|^p)}\leq2\right\}}<\infty}

with p=2 (see Definition 4.2).  The goal of phase retrieval is to determine an estimate \hat{x} sufficiently close to an unknown signal x_0 when only the measurement ensemble \left\{\varphi_n\right\}_{n=1}^{N} and quadratic data \left\{|\langle x_0,\varphi_n\rangle|^2+w_n\right\}_{n=1}^N are known.  In order to use the results of their stability analysis in this arena, Eldar and Mendelson attempt to minimize the product \left\Vert\hat{x}-x_0\right\Vert_2\left\Vert\hat{x}+x_0\right\Vert_2.  As noted by the authors, the appropriate approach to this is to minimize the empirical risk

\displaystyle{\ell_{x}=\frac{1}{N}\sum_{n=1}^{N}{\Big||\langle x,\varphi_n\rangle|^2-(|\langle x_0,\varphi_n\rangle|^2+w_n)\Big|^p}}

for some p>1.  However, since w is most likely nonzero, even \ell_{x_0} will be strictly positive; as such, it’s intuitively reasonable to consider \hat{x} to be a decent estimate for x_0 if \ell_{\hat{x}} is not much larger than \ell_{x_0}.  For this reason, it is convenient to work with the empirical excess risk \mathcal{L}_x=\ell_{x}-\ell_{x_0}.  With this, Eldar and Mendelson’s Definition 4.6 says \hat{x} is a good estimate of the signal x_0 if

\displaystyle{\mathcal{L}_{\hat{x}}\leq c(Q_{T,N,W}-\left\Vert|w|^p\right\Vert_{\psi_1}/\sqrt{N})},

where Q_{T,N,W} is defined by

\displaystyle{Q_{T,N,W}=\sup_{t\in T}{\left\Vert t\right\Vert_2}\left(\frac{\ell(T)}{\sqrt{N}}\right)+\frac{\ell^2(T)}{N}+\frac{\left\Vert|w|^p\right\Vert_{\psi_1}}{\sqrt{N}}}.

(Note that this implicitly requires T to be a bounded set.)  The presence of the Gaussian complexity here makes sense since we saw in Theorem 2.4 that stability depends on how small the complexity measure is.  Noticing that Q_{T,N,W} somewhat resembles \rho_{T,N} from Theorem 2.4, we see that minimizing the empirical excess risk is somehow analogous to finding the value of N which ensures stability: just as decreasing \rho_{T,N} made stability stronger in the noiseless case, decreasing Q_{T,N,W} makes it tougher to find an estimate \hat{x} satisfying Definition 4.6, thus making any such \hat{x} a stronger estimate.  The final main result of the paper (Theorem 4.8), in which the product \left\Vert\hat{x}-x_0\right\Vert_2\left\Vert\hat{x}+x_0\right\Vert_2 is bounded above under certain conditions, draws upon this intuition.  If \hat{x} is a good estimate for x_0, then there is a constant C' such that

\displaystyle{\left\Vert\hat{x}-x_0\right\Vert_2\left\Vert\hat{x}+x_0\right\Vert_2\leq C'(Q_{T,N,W})^{1/p}(1/\sqrt{p-1})}

for large enough N and a particular p\in(1,2].  The proof is tedious, but Theorem 4.8 is established by showing that \mathbb{E}\mathcal{L}_{\hat{x}} is bounded above by C'(Q_{T,N,W})^{1/p}(1/\sqrt{p-1}) and below by \left\Vert\hat{x}-x_0\right\Vert_2\left\Vert\hat{x}+x_0\right\Vert_2.  The exact choice of p is dependent on the set T and the error vector w, and is designed to be as close to 1 as possible in order to ensure that \mathbb{E}\mathcal{L}_{\hat{x}} can be bounded below (see Lemma 4.13).  Using this bound, it is possible to determine the necessary number of measurement vectors to guarantee stability.  For instance, Eldar and Mendelson use Theorem 4.8 to show that stability occurs with \mathcal{O}(M\log{M}) measurements when T is the unit sphere and \mathcal{O}((k\log{(eM/kd)+kd})(\log{k}+\log{d})) measurements when T is the set of k-block-sparse vectors with blocks of size d and unit norm.

Theorem 4.8 is particularly useful because it does not assume anything about how the estimate \hat{x} is determined.  Instead, suppose you have a reasonable routine to make the empirical risk small.  If you have some idea about the size of the noise w, you can use this as a proxy for \ell_{x_0}.  Then if your routine made the empirical risk sufficiently small compared to w, Theorem 4.8 guarantees that the corresponding estimate is reasonable in some sense.

Leave a comment