A couple of weeks ago, I was in Berlin for this workshop, and the talks were amazing — the organizers really did a fantastic job. For most of the talks, the slides are available here. There was an interesting mix of speakers; some had big problems in certain applications with the hope that compressed sensing might help, and others presented various generalizations and unifications of the standard compressed sensing theory. Overall, I noticed a few trends in the content of the talks:
- applying CS theory to imaging and video
- CS problems over the continuum or in a coherent dictionary
- nonlinear CS and phase retrieval
- dimension reduction for classification and machine learning
- measurement derandomization
The rest of this blog entry summarizes each talk. Some summaries are more brief than others, mostly according to how fervently I was writing notes during the talk.
Volker Mehrmann asked: Why do car brakes sometimes squeal? (This is clearly of interest to car manufacturers.) He showed us a video that illustrated the sort of squeal that occurs, and then he discussed a finite-element mechanical model of the brake system. This model takes the form of a differential equation which is continuous in time and discrete in space, and it can be formulated as a quadratic eigenvalue problem (QEP) to solve. Apparently, the likelihood of squealing is somehow related to the size of the positive real part of the eigenvalues, so this likelihood should be easy to determine, except the dimension of the operator is huge, and important parameters in the model range in orders of magnitude. This makes dimension reduction tricky, but he proposes a way to do this. Specifically, he projects the QEP into a low-dimensional subspace so that the it captures the important range of eigenvalues. Traditionally, one might take to capture the dominant eigenvectors, but this fails to capture damping and certain small (but important) parameters in the model. He also noted that a certain matrix in the model of particular importance is ill-conditioned, suggesting a poor finite-element model. However, he did find some success in randomly sampling certain eigenvectors to span and evaluating the resulting subspace in terms of the Arnoldi relation. There appears to be a lot of work to be done here.
Remi Gribonval discussed linear inverse problems with an arbitrary signal model. Specifically, he generalized the notion of the null space property (NSP), which characterizes when L1 minimization produces solutions to L0 minimization. Perhaps the most surprising part of this talk is that stable decoding is possible with knowledge of the noise level precisely when stable decoding is possible without knowledge of the noise level. This is because a robust version of NSP characterizes both events. He also talked about L2/L2 instance optimality, and established that lower RIP is necessary for robust NSP (and almost sufficient).
Petros Boufounos gave a particularly interesting talk. The idea is that I want to send a picture to a server, say, to find out the contents of my picture, but it would be rather costly to transmit the entire picture. In fact, it might be rather unnecessary, since the answer to my query certainly doesn’t depend on every single pixel. The idea is to apply a Johnson-Lindenstrauss (JL) projection and quantize for “random features” that will then be sent to the server for classification. But how does quantization- and JL-type distortions interact? He considered some examples using real-world data, and he also provided an alternative “generalized embedding” for quantized features. I think there’s a lot of room for investigation along these lines.
Mark Davenport had two main parts to his talk. (1) Compressed sensing in analog. Consider signals such that the support of the Fourier transform is contained in a few bands. This is difficult to model with an overcomplete DFT due to off-the-grid frequencies, but apparently, discrete prolate spheroidal sequences are perfectly time-limited, and are thus well-suited for this application (expect for certain pathological cases). (2) Compressed sensing in an overcomplete dictionary (this is sometimes called the synthesis model, and from personal experience, this problem is extremely difficult to solve). He discussed how to modify CoSaMP in such a way that solves the reconstruction problem (!), assuming there exists a way to approximate the following quantity for any given :
where denotes the orthogonal projection onto the columns of your dictionary with indices in . [Actually, the hardness of approximation here seems like a fundamental problem, so it would be interesting to see work on this in its own right.]
Ali Pezeshki focused on image inversion (i.e., estimating different image parameters) from compressed sensing measurements. First, he considered the impact of compression on Fisher information and SNR thresholds. The main idea is to derive Cramer-Rao bounds and determine breakdown thresholds (which are essentially phase transitions for noise levels, and aren’t tied to a particular algorithm like L1 minimization). Later, he discussed how a mismatch between the signal model and the signal reality (e.g., using a slightly wrong sparsity basis) can have devastating effects on reconstruction. This has an impact on spectral estimation and misfocus in optical imaging. He concluded by discussing the need for more theory on off-the-grid and analog compressed sensing.
David Gross presented how to derandomize the sensing matrix that’s used for PhaseLift. After building up the intuition of lifting to solve the phase retrieval problem, he introduced a conceptual parameter for a measurement ensemble to succeed with PhaseLift, namely that the ensemble exhibits some degree of agreement with the Haar measure on the sphere. As an example, he discussed -designs and showed that randomly selecting members of a -design suffices as a PhaseLift ensemble.
Christopher Rozell is interested in neuroscience, and his research seems to lie in the intersection of compressed sensing and dynamical systems. He started by discussing a “dynamic” variant of the standard sparsity model in which changes to the state are sparse. He showed an example of compressed sensing for video, where standard Kalman filters are impractical due to large matrix inverses, but his methods found a lower steady-state recovery error. Next, he discussed sensing with a network in a way that mimics short-term memory in a neural system, and he produced empirical phase transitions for such a sensing matrix. Lastly, he discussed locally competitive algorithms and circuit designs to alleviate computational bottlenecks for compressed sensing reconstruction. Perhaps the most eye-opening part of his talk was the explanation of Takens’ embedding theorem, which appears to be a dynamic generalization of the Whitney embedding theorem (and perhaps I’ll use it in the future!).
Waheed Bajwa discussed some work he and I did on subspace unmixing. In this problem, you have a collection of subspaces, and given a combination of members of some of these subspaces, you need to identify which subspaces are active in the combination. This problem may seem rather fundamental, and indeed, it arises in a number of applications, e.g., wireless networks, hyperspectral imaging, and block-sparse compressed sensing. The way we identify active subspaces is simple: project the given combination onto the subspaces, and predict that subspaces are active if the corresponding projection is large. It turns out that this simple algorithm works really well most of the time, provided the subspaces are sufficiently incoherent in a couple of ways. Hopefully, this work will land on the arXiv in the next couple of months.
Guillermo Sapiro presented a really neat idea for classification. Imagine you have points nearly lying in a union of subspaces, each subspace corresponding to a different class. It is helpful to apply a transform that geometrically separates the subspaces before running the classifier, but what is the best transform to use? Intuitively, you might think of this as a reasonable objective function to minimize:
where are matrices whose columns are the vectors in the th class, and . This setup is intuitively pleasing because we are forcing to keep each individual class as low-dimensional as possible while at the same time spreading out the multitude of classes into as many dimensions as possible. Intuitively, rank isn’t a good measure of “spread”, and so the following adjustment is made:
where denotes the nuclear norm. Note that this does not convexify the objective function, but rather it makes the objective truer to our intent. There might be an alternate objective which is easier to optimize and analyze (I have my own ideas), and he is open to such modifications, but he worked with this objective as a proof of concept. Simulations suggest that after learning a transform from some training set, the transform enables particularly good classification for data examples like NIST digits and faces. He also learned transforms for decision trees that the Kinect apparently uses to classify shapes and distances, and he found similar performance gains in this setting as well. [As an alternate objective, I wonder which linear transform maximizes the smallest principle angle between a given collection of subspaces. If the subspaces are one-dimensional, this seems like a generalization of the QR decomposition. It would also be interesting to see how this optimal angle changes as you vary the dimension of the transform’s range space.]
Rebecca Willett discussed the peculiarities of compressive imaging in a photon-limited setting. Here, the sensing model is based on the model that photons arrive according to a Poisson process with rate determined by the light’s intensity. Also, the physical sensing systems in this setting require the sensing matrix to have entries between and — the th entry encodes the proportion of photons from source that get sent to sensor . As such, the matrix needs to be scaled in a certain way to illustrate that it has the same null space as some RIP matrix, leading to certain performance guarantees. To incorporate the Poisson model, she proposed a -sparse localization constant which determines the minimax risk — amazingly, this risk depends on the sparsifying basis, meaning different sparsity bases will find different performance. This was corroborated with numerical simulations, and I suppose this makes some intuitive sense since the Poisson distribution is not isotropic like the Gaussian. Because of the inherent relationship between the rows and columns in this setting (i.e., the sensing matrix entries are nonnegative, and entries in the same column sum to at most 1), she also encountered a peculiar invariant: the number of measurements doesn’t matter much to the reconstruction performance. Also, in the low-intensity regimes, downsampling outperforms compressed sensing. Having spent a lot of time thinking about standard formulation of compressed sensing, it was interesting to see how subtle real-world modifications can lead to counter-intuition.
Volkan Cevher presented a new result regarding composite minimization. Apparently, this sort of optimization finds applications to graphical model selection and Poisson imaging. In this setting, one wishes to minimize an objective function which can be expressed as the sum of two functions, each sufficiently nice in a certain way. Specifically, one of the functions is required to be strongly convex and smooth with a Lipschitz gradient, while the other must be convex, but possibly not smooth. This talk removes the usual Lipschitz requirement by intelligently selecting the step size in the iterations by exploiting self-concordance.
Christof Schutte described the difficulty in modeling protein folding. The main issue is that running the physical models will provide an understanding of protein motion at a timescale that is much smaller than the folding timescale. You should think of folding as a change from one basic protein configuration into another, so in the current modeling timescale, the protein moves around in a fixed configuration, and we never see how it jumps to another configuration. He asks a simple question: Can we speed up the simulations by exploiting sparsity? He spent some time discussing a Markov-type model that describes the random time evolution of the protein’s state, and then he discussed a multi-timescale perspective of the random process. [To me, this had a striking resemblance to diffusion wavelets, which should be able to describe a random walk on a graph, but I’m not sure how applicable those ideas are to this particular stochastic process.]
Babak Hassibi started by briefly reviewing least squares, along with various performance guarantees when the matrix has iid entries. Of course, for the compressed sensing problem, one could perform least squares times and inherit these same guarantees, but this is inefficient. Alternatively, one could use something like LASSO to reconstruct efficiently, but the current guarantees are very loose compared to those corresponding to least squares. To rectify this, he first took a short detour to discuss the concepts like the descent cone, escape through the mesh, and statistical dimension. In the end, he was able to present guarantees using these ideas that matched the least squares guarantees, but with ambient dimension replaced with statistical dimension (this is the price you pay for efficient decoding). Very interesting!
Holger Rauhut discussed how to interpolate a smooth function from random samples given that it has an efficient representation in a finite (but large) collection of orthogonal basis functions. This setting naturally demands performance with respect to the -norm, and interestingly, this requirement naturally imposes a different type of objective function, namely a weighted L1 minimization. This leads to a rabbit hole in weighted sparsity, in which weights (instead of 1s) are added to calculate a weighted zero-norm, and a weighted RIP matrix leads to guarantees for the weighted CS problem, as you’d expect. In the end, he presented guarantees for how well you can expect to interpolate given so many random samples of the function, and the rate corresponds to a Stechkin-type estimate. In addition, he showed some numerical examples which did a great job of illustrating how well the optimization works.
Anders Hansen gave a very interesting talk about sparse inverse problems with Fourier-type measurements. In particular, he challenged the basic tenants of the typical compressed sensing formulation. First, he established that Fourier modes are not sampled uniformly because in practice, it is better to sample more at lower frequencies. Second, he questioned whether sparsity is even the correct model for the inverse problem. To make his point, he fixed an image, took compressed Fourier measurements, and attempted to reconstruct the image using L1 minimization in the wavelet domain (the reconstruction was pretty good, go figure). Next, he took the same image, permuted the locations of its wavelet coefficients, took compressed Fourier measurements, attempted to reconstruct, and then permuted back the locations of its wavelet coefficients. Perhaps surprisingly, the image looked rather terrible, suggesting that the performance of L1 minimization is not adequately explained by uniform recovery guarantees. [Perhaps this can be fixed by using TV-norm minimization instead. After all, this is empirically proven to be a better objective function. How might one implement the same sort of test in this case?] His interpretation is that the world is not sparse, but rather asymptotically sparse, meaning only entries with sufficiently large indices are sparse (this corresponds to a multi-scale intuition of wavelet sparsity). Another point he made is that analog inverse problems are coherent but asymptotically incoherent (e.g., sinusoids of sufficiently high frequency are incoherent with Haar wavelets, and sufficiently localized Haar wavelets are incoherent with the Fourier basis). In the end, he provided an example in which an alternative to random sensing that uses a multilevel Hadamard transform. This allows you to require far less RAM at the expense of universality, but he likened the latter to preparing for nuclear winter, which is perhaps unnecessary.
EDIT: Ben Adcock emailed me the following comment regarding the permute-the-wavelet-coefficients test in the TV-norm case:
Flipping the wavelet coefficients will likely change the TV norm. However, the minimization strategy used was L1, so this is not the point of the test. However, I am highly confident that a similar test will show the same results for TV-norm minimization. Devising a test for this case is a little more fiddly, but essentially, one would want to ‘permute’ the image by bringing its edges closer together. This should do the trick. Moving edges closer together doesn’t change the TV norm, but will alter the Fourier spectrum, and this is what one needs.
Matthew Fickus gave a colorful portrayal of certain connections between coding theory and line packings and their relationship to compressed sensing. In particular, he reviewed the notion of optimal self-complementary codes, and showed how to transform them into equiangular tight frames (like in this blog post). He also reviewed the two ways to construct equiangular tight frames in general (using difference sets, and using Steiner systems). He then related the Steiner construction back to the coding-theoretic construction by leveraging a trick of John Jasper that transforms certain Steiner constructions into constructions whose entries all have equal modulus. With this equivalence, he was able to leverage the low spark of the Steiner construction to reveal that all of the constructions are “shockingly bad” as compressed sensing matrices. This suggests that the constructions lost their higher-level incoherences due to an over-optimization of their pairwise incoherence. This talk was based on a recent paper of ours.
Martin Vetterli discussed a few inverse problems, and highlighted the differences between the discrete formulation of compressed sensing and the continuous formulation of finite rate of innovation. When dealing with continuous inverse problems, compressed sensing finds difficulty at the distretization step (i.e., off-the-grid issues), where as finite rate of innovation tends to lack stability guarantees. The bulk of his talk discussed various successes with finite rate of innovation. First, he showed that it is possible to hear the shape of a room, in particular, if one uses 4 randomly placed microphones to listen to one loudspeaker. The main idea is that echoes can be separated using the fact that euclidean distance matrices have low rank. In the future, he would like to be able to hear one’s location in a room. The next problem he discussed is how to optimally place sensors, and the idea is to find a subcollection of measurement vectors which minimize the frame potential (as a proxy for mean squared error of least-squares reconstruction) by greedily deleting measurements. The last problem he discussed was how to discern the nuclear fallout of Fukushima given data from specific locations around the world (here, access is limited according to international relations). Using a Lagrangian dispersion model for how particles flow around the world, and applying a sparsity assumption for the original location of particles, he was able to determine the original locations and predict the contribution from Fukushima. From here, he can run the Lagrangian dispersion model from this initial condition to estimate how many particles there are around the world. In the end, he provided the following words of wisdom: “Pick your tool given the problem”, suggesting an application-driven approach to solving inverse problems.
Reinhold Schneider discussed a tensor analog to low-rank matrix completion. Specifically, he used a new tensor representation from Hackbusch’s “Tensor spaces and numerical tensor calculus” to simplify the task. Unfortunately, the feasibility region is not convex, but it is a smooth manifold, so he uses a greedy algorithm to iteratively step along the tangent space and project onto the manifold. Overall, he provides local convergence guarantees using ideas from iterative hard thresholding and a tensor restricted isometry property.
Shmuel Friedland introduced the compressed sensing of sparse tensors by first reviewing a bit of compressed sensing for vectors and matrices. He then introduced a serial and parallelizable method for reconstructing tensors and presented simulations that compared their superior performance to a couple of alternatives.
Massimo Fornasier discussed how to perform compressed sensing in cases where the measurements available are nonlinear. First, he indicated that there are multiple notions of nonlinearity, and one must pick a notion before finding an algorithm (i.e., there is no silver bullet algorithm in the nonlinear case). From here, he introduced the notion of quasi-linearity, that is, functions of the form , where is a matrix-valued Lipschitz-continuous function of the vector . He pointed out two applications which exhibit quasi-linearity: asteroseismology, where one is interested in discerning oscillations of pulsating stars, and phase retrieval. For asteroseismology, he was able to produce a convergence guarantee for a greedy algorithm using a generalized notion of the restricted isometry property, and interestingly for phase retrieval, the theory needed to be amended due to the global-phase-factor ambiguity.
Richard Baraniuk started his talk with a humorous infomercial for compressed sensing. Next, he focused his attention on a particular application of compressed sensing, namely compressive video. He introduced an interesting issue with this application: there’s a uncertainty-principle-type trade-off between resolution in time and in space. This is actually rather intuitive — if each frame represents a large exposure time, then you incur blurring over time (due to motion), while static objects will be resolved spatially since you receive enough photons; on the other hand, small exposure times minimize blurring over time at the price of fewer photons per frame (and poorer spatial resolution). As such, given a scene with certain levels of motion, you can tune the video frames’ exposure time so as to minimize error. After clarifying this interesting subtlety in video processing, he moved on to explain how one might do compressive video. The big idea is to exploit some sort of structure to stitch the different frames together and enjoy higher resolution across frames accordingly. To access this structure, he first compressively images a low-resolution version of each frame and uses these images to determine optical flow. Then he compressively images the high-resolution versions and reconstructs using an optimization that encourages each frame to look like a natural image (think sparsity in wavelets or small TV norm) while at the same time exhibiting the optical flow across frames that was estimated from the low-resolution versions. Most importantly, the high-resolution measurements are not adaptive, meaning the low- and high-resolution sensing measurements can be captured at the same time in real life. While this sensing procedure makes a lot of sense, it presents some challenges: (i) there are only two measurement scales, and so objects moving at different velocities are difficult to image at the same time; (ii) since the sensing matrix is designed using random s, there is no fast algorithm to apply and ; (iii) the reconstruction optimization is slow. He then addressed (ii) by focusing on a new type of sensing matrix for compressive video. Specifically, he introduced the Sum-To-One (STOne) transform, which is a matrix gotten by taking Kronecker products of the following matrix:
This matrix has some amazing properties. For example, it’s unitary when suitably scaled (it’s embarrassing how “long” it took me to see this). Also, all of the entries are , which means randomly selected rows will form a great compressed sensing matrix. Due to the Kronecker-based construction, you can implement the transform in steps using a divide-and-conquer method like the fast Walsh-Hadamard transform. But perhaps the most exciting feature of the STOne transform is how it relates to low-pass image filters. To see this, first think about vectorizing a image by ordering the pixels according to a th order Hilbert curve, and for clarity, let’s call “imagizing” the inverse of “vectorizing”. Then imagine you have an image that you vectorize, take the STOne transform of, and then imagize. What happens if you partition the image into blocks, and then replace each pixel value with the sum of its block’s pixel values? Well, by the nature of the Hilbert curve, pixels in a common block come from neighboring rows in the STOne transform matrix, and due to the Kronecker structure (and since the row vectors of the matrix above sum to the all-ones vector) the sum of these rows is an up-sampled version of a row from the STOne transform matrix. This means the following: downsampling a low-pass version of the th order STOne transform of your image is the same as taking the st order STOne transform of a downsampled low-pass version of your image! The fact that STOne “commutes” with multiscale processing is really neat, and it’s an important feature for fast compressive video.
I talked about how to derandomize matrices with the restricted isometry property (RIP). I introduced the subject by discussing the deterministic RIP matrix problem, and reviewing the recent progress on this problem (namely, here, here, here, and here). Next, I discussed a relaxed version of the problem, in which one tries to minimize the number of rows while at the same time minimizing the number of random bits used in the construction. The state of the art here is interesting, as there appears to be a trade-off between the number of rows and the amount of randomness amongst known RIP constructions, and this trade-off barrier might be viewed as an extension of the square-root bottleneck. I then showed how to cross this barrier using serious conjectures from number theory along with ideas from the deterministic RIP paper of Bourgain et al. Hopefully, these ideas will appear in a paper on the arXiv soon, and I might blog about it more then.
Rachel Ward showed how to solve least-squares problems more efficiently using a modified version of the randomized Kaczmarz algorithm. More generally, she explained how stochastic gradient descent grabs a row at random to form a proxy for the gradient, and how you can improve performance by changing the distribution over which you grab rows. In particular, her partially biased sampling technique leads to a convergence rate that exhibits a more favorable dependence on the condition number. She also discussed how one might apply these ideas to optimize any sufficiently nice objective.
Felix Krahmer introduced a cool trick to solve the PCA problem when the objective is to minimize the sum of the th powers of the perpendicular errors for some . There exist routines to hunt for a near-optimal subspace of fixed dimension , but the runtimes scale poorly with the ambient dimension. To resolve this, he uses a really cool trick that he calls greedy least squares: Hunt for a subspace which is optimal for , delete the closest points to the subspace, and repeat. This results in a collection of subspaces whose Minkowski sum will be really close to the original points. In fact, if you deleted points appropriately during greedy least squares, then those points will be close to the subspace in the sense by a pigeon-hole-type argument, and the total number of iterations in greedy least squares will be like log the number of points . As such, you end up with a very small subspace (of dimension ) that nearly contains the original point cloud — this dimensionality reduction will allow one to efficiently find a near-optimal -dimensional subspace of this -dimensional space, thereby producing a decent solution to the PCA problem. [I wonder if similar ideas can be used to efficiently find a deterministic JL projection that depends on any given point cloud.]
Helmut Bolcskei gave a talk with the goal of developing a unified theory for sparse signal recovery algorithms and establishing fundamental performance limits. He discussed a multitude of different problems that he identifies with the sparse signal recovery problem, e.g., signal separation (in which you are given the sum of, say, a vinyl recording and random clicks), superresolution, inpainting, and de-clipping (this is a cool problem: you’re given a function from some known function class, but the instrument used to record the function was not capable of recording function values outside of a certain range). All of these problems can be expressed as
where and are different dictionaries, and are assumed to be sparse, and is the data you receive. To study the fundamental limits of recovery, he first discussed different uncertainty principles related to this problem, and he identified the importance of incoherence between the dictionaries and . Next, he considered an information-theoretic formulation in the style of Wu and Verdu. That is, assuming a random model on and , when does there exist a (not necessarily efficient) decoder such that with probability ? This is similar to the empirical phase transitions we have seen for compressed sensing with convex optimization, except in this case, he is only concerned with the existence of a decoder. Interestingly, his results are in terms of Minkowski dimension, whereas phase transitions from convex optimization are in terms of statistical dimension.
Miguel Rodrigues discussed compressive classification. Here, the idea is to take measurements , where belongs one of classes and is random noise. In his model, the class that belongs to follows some distribution, and given the class , is distributed according to a Gaussian with a class-specific mean and covariance matrix. The goal is to estimate given . He related this problem to problems in wireless communication, in which performance is characterized by concepts like diversity gain and measurement gain. He used this identification to discuss the two-class case, and then he considered the application of compressively classifying faces.
Peter Jung used the application of machine-type sporadic communication to motivate a new type of sparse recovery problem. Typically, you are given , where is data, is the measurement matrix, is a known dictionary, is a sparse unknown, and is error. However, in some cases, you might only know that has a compressible representation, i.e., is a sparse combination of known dictionaries . In this case, you are actually given , where is a bilinear form determined by the dictionaries . With this model, he introduced an RIP-like condition on , and then he showed that if satisfied such a condition, then a random sensing matrix will also satisfy an RIP-like condition. (This was proved using the JL-to-RIP ideas of this paper.) He also used the RIP-to-JL ideas of this paper to find efficient implementations using structured measurements or a random demodulator. Interestingly, while he is able to find RIP-type conditions in a certain regime, this regime does not currently have a convex recovery algorithm with guarantees. He concluded with simulations from the application of sporadic communication.
Philipp Walk discussed an interesting problem motivated by linear time-invariant systems. Specifically, if you send a signal over a channel , the receiver gets . If you want to distinguish multiple signals over a fixed channel , or distinguish multiple channels using a fixed signal , then you seek some notion of injectivity, which is guaranteed if the ‘s and ‘s satisfy something he calls the restricted norm multiplicativity property:
for some . (This is related to the RIP-type property that Peter Jung discussed in the previous talk.) Philipp presented a theorem that sparse functions and over a finite abelian torsion-free group necessarily satisfy the above property, where and depend on the sparsity levels of and . To prove this, he leveraged cool ideas from additive combinatorics, including a recent proof of a conjecture of Konyagin and Lev involving Freiman isomorphisms. He then discussed some implications of his result, including a restricted norm multiplicativity property for sparse zero-padded circular convolutions and an injectivity result for a particular real construction of measurement vectors (which bears a striking resemblance to the construction in this paper).