# Fundamental Limits of Weak Recovery with Applications to Phase Retrieval

Marco Mondelli recently posted his latest paper on the arXiv (joint work with Andrea Montanari). This paper proves sharp guarantees for weak recovery in phase retrieval. In particular, given phaseless measurements against Gaussian vectors, they demonstrate that a properly tuned spectral estimate exhibits correlation with the ground truth, even when the sampling rate is at the information-theoretic limit. In addition, they show that their spectral estimate empirically performs well even when the measurements follow a more realistic coded diffraction model. I decided to reach out to Marco to learn more, and what follows is my interview. I’ve lightly edited his responses for formatting and hyperlinks:

DGM: Judging by your website, this project in phase retrieval appears to be a departure from your coding theory background. How did this project come about?

MM: Many of the tools employed in information and coding theory are very general and they prove useful also to solve problems in other fields, such as, compressed sensing, machine learning or data analysis. So this is the general philosophy that motivated my “detour”.

The interest in the specific problem of weak recovery for phase retrieval came about for the following reason. The techniques that are typically used to derive bounds on the sample complexity of weak recovery are based on matrix concentration inequalities. This means that one computes the expected value of the data matrix and then shows that, when the number of samples is large enough, such a matrix concentrates around its mean. However, this procedure does not give tight bounds. So, we thought that we could compute the spectrum of the (random) data matrix and, consequently, obtain an exact bound. This bound is exact in the sense that it characterizes exactly the performance of the spectral method. At that point, we were wondering whether such a bound is information-theoretically optimal. This motivated us to prove a converse bound, which turned out to match the spectral upper bound.

DGM: I thought the information-theoretic lower bound of $\delta\geq1$ was obvious until I realized that weak recovery is possible in the linear case whenever $\delta>0$. Do you have any intuition for this bound? What are the big ideas in the proof?

MM: Unfortunately, we do not have a good intuition of why the information-theoretic threshold is $1$ for the complex case and $1/2$ for the real case. Indeed, as you point out, weak recovery is possible in the linear model for any $\delta > 0$.

As for the proof, the basic idea consists in bounding the conditional entropy of the received vector via the second moment method. More formally, let $\mathbf{Y}$ be the received vector and $\mathbf{A}$ the measurement matrix. Then, the goal is to evaluate $H(\mathbf{Y} \mid \mathbf{A})$. To do so, we compute the ratio between the second and the first moment of $p(\mathbf{y}\mid \mathbf{A})$. In particular, we prove that for $\delta=n/d < 1$, the quantity

$\displaystyle{\int_{\mathbb{R}^d}\frac{\mathbb{E}_{\mathbf{A}}\left\{\left(p(\mathbf{y}\mid \mathbf{A})\right)^2\right\}}{\mathbb{E}_{\mathbf{A}}\left\{p(\mathbf{y}\mid \mathbf{A})\right\}} \,\mathrm{d}\mathbf{y}}$

is sublinear in $n$. As a result, the conditional entropy is equal to un-conditional one, which means that the received vector does not give information about the unknown signal. This also means that the error of the Bayes-optimal estimator is equal to the error of the trivial estimator that does not use the received signal at all.

DGM: What’s the intuition behind allowing $\mathcal{T}(y)$ to be negative? Your optimal choice of $\mathcal{T}$ happens to be negative when $y$ is small. Are you somehow penalizing the directions that are nearly orthogonal to $x$?

MM: The optimal pre-processing function is given by the expression

$\displaystyle{\mathcal{T}^*_{\delta}(y) = \frac{y-1}{y+\sqrt{\delta}-1}.}$

Indeed, $\mathcal{T}^*_{\delta}(y)$ is negative when $y$ is small. As you suggested, the intuition is exactly that the points in which the measurement vector is basically orthogonal to the unknown signal are not informative, hence we penalize them.

This is quite different from the spectral methods that have been considered so far (see this, that, another). Earlier works employed pre-processing functions that (i) are positive and (ii) try to extract information from the large values of the data. On the contrary, especially when $\delta$ is close to $1$, the function $\mathcal{T}^*_\delta$ has a large negative part for small $y$. Furthermore, it extracts useful information from the small values of the data.

DGM: Impressively, your spectral estimate works whenever $\delta>1$. How do you leverage free probability along with Lu and Li’s result to demonstrate such a sharp bound?

MM: The analysis of the performance of the spectral method is based on the evaluation of the spectrum of the data matrix $\mathbf{D}= \mathbf{A} \mathbf{Z} \mathbf{A}^*$, where $\mathbf{Z}$ is a diagonal matrix that contains the vector $(\mathcal{T}(Y_1), \ldots, \mathcal{T}(Y_n))$ on the diagonal. This computation boils down to the study of the spectrum of a matrix of the form $\mathbf{U} \mathbf{M} \mathbf{U}^*$, where $\mathbf{U}$ has i.i.d. Gaussian entries and $\mathbf{M}$ is independent of $\mathbf{U}$. Furthermore, the spectrum of $\mathbf{M}$ a.s. converges weakly to a bulk, given by the probability law of $\mathcal{T}(Y)$, and its largest eigenvalue converges a.s. to a point outside the bulk. Now, also the spectrum of $\mathbf{U} \mathbf{M} \mathbf{U}^*$ a.s. converges weakly to a bulk. The idea is that if the largest eigenvalue of $\mathbf{U} \mathbf{M} \mathbf{U}^*$ converges a.s. to a point outside this bulk, then the principal eigenvector of $\mathbf{D}$ is correlated with the unknown signal, which means that we have weak recovery.

If $\mathbf{M}$ is PSD, then the spectrum of $\mathbf{U} \mathbf{M} \mathbf{U}^*$ is studied in this paper of Bai and Yao. However, for $\mathbf{M}$ to be PSD, we need that the pre-processing function is positive. This is the approach carried out in the paper by Lu and Li.

In order to remove this assumption, we decompose $\mathbf{M}$ into a positive part $\mathbf{M}^+$ minus a negative part $\mathbf{M}^-$. Both $\mathbf{M}^-$ and $\mathbf{M}^+$ are PSD, so we can apply the results of Bai and Yao. The free probability tools naturally come in at this point, as we need to study the spectrum of the sum of two random matrices.

DGM: The fact that the threshold in the real case is half the threshold in the complex case is reminiscent of how the injectivity thresholds are $\delta=2$ and $4$ in the real and complex cases, respectively. (See this, that, and another paper.) Do you have any intuition for this halving phenomenon in weak recovery?

MM: One way to think about this is as follows. In the complex case you have twice as many variables but the same amount of equations of the real case. Hence, it makes sense that the threshold for the complex case is two times the threshold for the real case.

This is only an heuristic argument and, at a rigorous level, we could not find a mapping between the complex problem and the real problem. Indeed, we have two (quite similar) proofs that handle separately the real and the complex case.

DGM: What’s next for this line of investigation? Is there any hope of establishing sharp thresholds for $\epsilon$-weak recovery, where the desired correlation in equation (3) is prescribed?

MM: One possible future direction is certainly about studying $\epsilon$-weak recovery. In this regard, I would like to mention the recent ArXiv submission by Barbier et al. There, the authors consider the real case and prove that the replica-symmetric formula from statistical physics gives an exact prediction. In this way, they obtain a single-letter formula for the mutual information and for the MMSE. However, it is still not proved that this lower bound can be met by a practical algorithm.

Perhaps an even more interesting direction consists in looking at other measurement matrices. Our analysis covers the case in which $\mathbf{A}$ is Gaussian, while in practice people use, for example, Fourier measurement matrices. The rigorous study of Fourier measurement matrices seems definitively challenging, so one intermediate step in that direction could be to consider the case in which $\mathbf{A}$ is unitary and Haar-distributed.

Let me also mention that we tried our spectral algorithm in a coded diffraction model, where $\mathbf{A}$ is a Fourier matrix and the unknown signal is a digital image. Our algorithm significantly outperformed the existing spectral methods and, even more surprisingly, the theoretical predictions obtained for the Gaussian case matched quite well the numerical simulations. This is still a bit of a mystery for us, especially in consideration of the fact that the theoretical predictions worked well only for our optimal choice of the pre-processing function, while they failed for a different choice of the pre-processing function.