# Foundations of Data Science Boot Camp

I’m spending the semester at the Simons Institute for the Theory of Computing as part of the program on Foundations of Data Science. This was the first day of the “boot camp” week, which was organized to acquaint program participants with the key themes of the program. Today, we heard talks from Ravi Kannan and David Woodruff. Below, I link videos and provide brief summaries of their talks.

Ravi Kannan — Foundations of Data Science

Both parts of this talk covered various aspects of data science. A more extensive treatment of these aspects are provided in his book with John Hopcroft (available for free here).

Ravi started by describing concentration of measure. Here, he sketched a general (possibly folklore) result: If the first k moments of a mean-zero random variable are appropriately small, then the sum of n independent copies of this random variable is small with appropriately high probability. This struck me as a finite-k version of the inequality (1.3) given here.

Next, he focused on dimensionality reduction. Here, there are two main methods of interest: random projection and PCA. Random projection is of particular interest to theory since it preserves all distances in the data set (unlike PCA), whereas PCA is more popular in practice since the distortion is much lower, although not guaranteed for all pairwise distances. At this point, he sketched the Dasgupta–Gupta proof of JL. Ravi described the k-means problem as a potential application of JL. Next, he motivated PCA with a toy problem: Given a mixture of two gaussians with identity covariance and means 0 and x, where x has norm O(1), how can one estimate x? Here, random projection and k-means clustering (with k=2) both fail miserably, but PCA/SVD works great.

Ravi started the second part of this talk with the application of clustering. How do you estimate the means in a mixture of k gaussians with identity covariance? Distance-based clustering requires a distance of at least $d^{1/4}$, where d is the dimension. As you might expect, you can find the subspace spanned by the gaussian’s means by running PCA/SVD. He also took some time to discuss proximity as a data hypothesis for clustering in lieu of a stochastic data model (see this paper, for example).

Next, Ravi covered randomized algorithms for various linear algebra routines. The key ideas here are to represent a matrix by a sketch of randomly selected rows and columns, and then leverage this sketch to approximate the desired quantities. He stressed the need to randomly sample with a probability distribution that’s proportional to the square of the norm of the row/column vectors (see this paper, for example).

Ravi concluded by discussing Markov chains. Here, he had two main points: First, no one computing a stationary distribution for a Markov chain should care about whether the Markov chain is periodic, since this technicality can be smoothed out by taking running averages. Second, the mixing time to stationarity is inversely proportional to the square of the conductance (or Cheeger constant) of the graph.

David Woodruff — Sketching for Linear Algebra: Basics of Dimensionality Reduction and CountSketch

The purpose of this talk was to illustrate the utility of sketching for linear regression and low-rank matrix approximation. David started by reviewing the basics of linear regression: Linear least squares corresponds to the MLE in the gaussian noise case, enjoys a nice geometric interpretation in terms of orthogonal projections, and also has a nice algebraic expression in terms of the normal equations. However, solving the normal equations can take a while when there are many data points to fit.

To overcome this runtime bottleneck, one can consider a sketch-and-solve approach: Use randomness to pass to a smaller instance of the same problem, solve that problem exactly, and then conclude that the corresponding solution to the original problem is approximately optimal with high probability. For example, instead of finding a least-squares solution to Ax=b, one may instead solve SAx=Sb, where S is short and fat but SA is still tall and skinny. If S has iid gaussian entries, then one may exploit JL over an epsilon net to conclude that S preserves the norms of every point in the column space of A (plus the span of b), meaning $\|SAx-Sb\|_2$ is within a factor of $1+\epsilon$ of $\|Ax-b\|_2$ for all $x$, and so the minimizer of the first is within a factor of $1+\epsilon$ of optimal for the second. Sadly, picking S in this way does not lead to speedups since multiplying SA takes too long. Instead, one may take S to be a subsampled randomized Hadamard transform (see this paper).

In the case where A is sparse, one might want the runtime to scale with the number of nonzero entries of A. In this case, one should instead design S according to CountSketch. Here, if A is $n\times d$, then S is chosen to be $k\times n$ with $k=O(d^2/\epsilon^2)$, with each column being drawn uniformly from the columns of the $k\times k$ identity matrix, and randomly signed. Here, k scales like $d^2$ for birthday paradox reasons.

The second part of David’s talk started by discussing a high-precision version of the regression result. In particular, one may convert the $1/\epsilon$ factor in the runtime to $\log(1/\epsilon)$. To do this, he solves least squares by gradient descent, and to do this, he first leverages sketching to find a good initial point, and then he leverages sketching again to precondition A, thereby making the gradient descent iterations converge appropriately quickly. To precondition, leverage the QR factorization of SA and note that the condition number of $AR^{-1}$ is at most $(1+\epsilon)/(1-\epsilon)$. (To see this, compare the norm of $SAR^{-1}x=Qx$ to the norm of $AR^{-1}x$.)

The end of his talk discussed low-rank approximation. Here, the goal is to use sketching to speed up SVD. To do so, David first computes SA, then projects the rows of A onto the rowspace of SA, and then runs SVD on the resulting vectors. Here, the bottleneck is projection, which can be encoded as an instance of least squares, meaning the above technology transfers to this setting. He concluded his talk by explaining why you can expect a nearly-optimal solution to reside in the rowspace of SA when S is something called an affine embedding (a generalization of subspace embedding).