# Foundations of Data Science Boot Camp, II

This is the second entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Tuesday, we heard talks from Ken Clarkson, Rachel Ward, and Michael Mahoney. Below, I link videos and provide brief summaries of their talks.

Ken Clarkson — Sketching for Linear Algebra: Randomized Hadamard, Kernel Methods

This talk introduced leverage scores and their application a few settings. Say S is a row sampling matrix if each row is a multiple of a row of the identity matrix. Given a matrix A, we want a random row sampling matrix S such that (with high probability) SAx has approximately the name norm as Ax for every x. As a general construction, assign to each row of A a probability $p_i$, and let S be a random matrix such that $e_i^\top/\sqrt{p_i}$ is a row of S with probability $p_i$ (independently for each i). Notice that S has a random number of rows, and the expected number of rows is $\sum_ip_i$. It’s easy to show that for each x, $\mathbb{E}[\|SAx\|^2]=\|Ax\|^2$. In order to concentrate around this mean, we want to minimize variance, which can be accomplished by tuning the $p_i$‘s to reflect the “importance” of the rows of A. Intuitively, one might select $p_i$ to be proportional to the square of the norm of the corresponding row of A. (This works well when $A^\top A$ is a multiple of the identity.) In general, we want to select $p_i$ so as to bound the maximum relative contribution of the $i$th component: $p_i\geq (Ax)_i^2/\|Ax\|^2$ for every x. Then one may conclude concentration by passing to Bernstein’s inequality. In fact, if we write A=UR with $U^\top U=I$, then we may take $p_i$ to be the square of the norm of the corresponding row of U. These quantities are called leverage scores. Notice that picking $p_i$ in this way makes it so that the average number of rows in S equals the dimension of x.

Readers of this blog are perhaps familiar with the transformation $A\mapsto U$ as converting to a Parseval frame. Ken referred to the resulting row vectors as exhibiting “isotropic position,” which appear to be geometrically natural. In addition, the use of leverage scores in sampling appears, for example, in Spielman–Srivastava graph sparsification. Overall, the pursuit of leverage scores is well motivated. Moreoever, we can quickly estimate them to within a factor of $1+\epsilon$ by taking the QR decomposition of a sketch of the matrix (as in David Woodruff’s talk). Such a coarse estimation is sufficient for sampling purposes, and recent work shows that higher-precision estimates require much longer runtimes. Ken concluded by sketching Tropp’s analysis of the subsampled randomized Hadamard transform in terms of leverage scores.

Rachel Ward — First-Order Stochastic Optimization

Rachel motivated the use of stochastic gradient descent (SGD) in machine learning. The main idea is that the objective function is a sum of many component functions, each component corresponding to a different data point. When there are many data points, it is time prohibitive to look at all of them to perform gradient descent. Instead, we “sketch” the gradient by grabbing random component functions, which gives SGD. There are two main theoretical questions: (1) how to pick the step size, and (2) how to select random component functions.

For (1), we note that, intuitively, if the step size is constant, then iterations will eventually bounce around a local minimizer. This intuition can be made formal with appropriate hypotheses: Assuming the component functions are sufficiently consistent (meaning they are all minimized “somewhat simultaneously”) and the overall objective is smooth and strongly convex, then one may show that for any given tolerance $\epsilon>0$, there exists a constant step size (determined by component consistency, smoothness, strong convexity, and $\epsilon$) such that the SGD iterates are expected to be within $\epsilon$ of the optimizer after an appropriate number of iterations (see this paper).

For (2), we note that, again intuitively, certain components may be more important than the others when computing the gradient. For example, in the least-squares setting, leverage scores may signal the importance of a given component. In general, the Lipschitz constant of each component seems like a good proxy for importance. Rachel finds that a good choice of weights is a mixture of the uniform distribution and the distribution proportional to the Lipschitz constants. In some cases, this choice makes convergence much faster.

Rachel concluded with some open problems. In the real world, you don’t have access to Lipschitz constants, etc., so how can you learn good step sizes and sampling weights for SGD? For the step size problem, Rachel has some recent work along these lines (see this paper).

Michael Mahoney — Sampling for Linear Algebra, Statistics and Optimization

This talk covered several topics with a couple of key themes. One theme: Michael pointed out a difference between “the world” and “the machine.” Specifically, we collect data from “the world” and plug them into “the machine.” When the data is big, we use ideas from random numerical linear algebra sketch the data and estimate quantities about the data, e.g., approximately solving the least squares problem. However, the primary goal of statistics/ML is to make inferences/predictions about the world. As such, it is important to be cognizant of how our computational tools will be used in practice so that we can meet the demand with appropriate performance guarantees, etc. This perspective leads to a second theme of Michael’s talk: Whether one approach is better than another almost always depends on the broacher statistical/ML objectives. For example: Is it better to do random projection or PCA? It depends.

Of all of the topics Michael covered, I was most interested in his take on approximate matrix multiplication. Here, let $a_i$ denote the $i$th column of A, and $b_i^\top$ the $i$th row of B. Then $AB=\sum_i a_ib_i^\top$. We can estimate this sum with a sum over a random sample of terms, and we obtain concentration by biasing our sample towards the important terms, e.g., the terms with the largest norm. In the end, one can control the relative error in the implied randomized algorithm. (I wonder if this low-precision estimate can be promoted to a high-precision estimate with a cheap iterative algorithm?)