# Foundations of Data Science Boot Camp, V

This is the fifth (and final) entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Friday, we heard talks from Ilya Razenshteyn and Michael Kapralov. Below, I link videos and provide brief summaries of their talks.

Ilya Razenshteyn — Nearest Neighbor Methods

The nearest neighbor search problem first preprocesses n points P in some metric space with a distance scale r>0, and then when queried a new point q in the metric space, the output should be a member of P that is within r of q. To solve this problem, the best known solution leans on substantial preprocessing that requires exponential space. However, many settings don’t require the output to be within r of q, but within cr of q for some approximation parameter c>1. This motivates approximate nearest neighbor (ANN) search. In his talk, Ilya discussed data-oblivious methods and various data-aware methods. The talk moved from the Hamming metric space, to $\ell_1$, to $\ell_2$, to $\ell_\infty$, and finally to general metric spaces.

To solve ANN for the Hamming metric space, randomly sample k coordinates, and consider all points that exactly match q in these coordinates. We can scale k so that there are O(1) points that match this specification, and then we determine which of these points is closest to q. We may repeat this process $O(n^{1/c})$ times to succeed with probability 0.99. In general, one might consider a locality-sensitive hash (LSH), which is a random partition of the metric space such that nearby points collide with high probability and far-away points are separated with high probability. The above coordinate-sampling hash is an example of LSH, and can be extended to $\ell_1$. For the sphere in $\ell_2$, one may partition according to random hyperplanes (producing intuitive, but sub-optimal, results), or draw points at random from the unit sphere and then carving out space according to their Voronoi regions.

For a data-dependent alternative, we first observe that Voronoi LSH works best when the data points are well distributed on the sphere. If instead, the data has clustered portions of data, we may remove those clusters until the remaining data set is well distributed on the sphere and apply Voronoi LSH. Then we can recurse on the clusters. See this paper for more information.

For more general metrics, one is inclined to leverage a bi-Lipschitz embedding into $\ell_1$ or $\ell_2$, but this is not feasible for many metric spaces. Sometimes, it’s easier to embed into $\ell_\infty$, and for this space, there is a data-dependent ANN algorithm for $c=O(\log\log d)$ (see this paper). This is based on a fundamental dichotomy: for every n-point dataset, there is either a cube of size $O(r\log\log d)$ containing $\Omega(n)$ points, or there exists a coordinate that splits the dataset into balanced parts. This dichotomy suggests a recursive ANN algorithm.

The above dichotomy can be replicated for general metric spaces. Define the cutting modulus to be the smallest number $\Xi$ such that for every n-vertex graph embedding into the space such that edges have length at most K, either there is a ball of radius $K\Xi$ containing $\Omega(n)$ vertices, or the graph has an $\epsilon$-sparse cut. While the cutting modulus is difficult to compute for general metric spaces, it brings a nice intuition: ANN is “easy” for metric spaces that don’t contain large expanders. See this paper for more information. Unfortunately, this result uses the cell-probe model of computation, which can substantially underestimate runtime. In order to move beyond this model, one would need to somehow ensure that there exist “nice cuts” at each iteration, e.g., the coordinate-based cuts in the original $\ell_\infty$ case.

Michael Kapralov — Data streams

In many applications, the dataset is so large that we can only look at the data once, and we are stuck with much smaller local memory. What sort of statistics about the data can be approximated with such constraints?

The first problem of this sort that Michael discussed was the distinct elements problem. Here, we are told that every member of the data lies in $[n]=\{1,\ldots,n\}$, and given a single pass of the data with only $\mathrm{polylog}~n$ storage, we are expected to approximate the number of distinct elements seen (i.e., the size of the support of the histogram of the data) within a factor of $1\pm\epsilon$ and with success probability $\geq1-\delta$. To solve this, first pass to a decision version of the problem: Can you tell whether the number bigger than $(1+\epsilon)T$ or smaller than $(1-\epsilon)T$ for some T? To solve this, just pick $S\subseteq[n]$ with $\mathrm{Pr}(i\in S)=1/T$ iid. Then we can maintain a count of how much of the data lies in S. (A positive count is highly unlikely if the desired number is less than T.) We can boost this signal by taking $O((1/\epsilon^2)\log(1/\delta))$ independent choices of S.

Now consider a problem in which we stream edges in a graph, and after seeing all of the edges, we are asked to approximate the size of some cut in the graph. (The cut query comes after seeing the edges!) Consider the $\binom{n}{2}\times n$ matrix B with rows indexed by pairs of vertices and columns indexed by vertices. Say the row of B indexed by $\{u,v\}$ is $\pm(e_u-e_v)$ if $\{u,v\}$ is an edge in the graph, and zero otherwise (the overall sign doesn’t matter here). Then for every subcollection $C\subseteq[n]$ of vertices, the number of edges in the graph between $C$ and $[n]\setminus C$ equals the squared 2-norm of $B1_C$, which can be maintained with the help of a JL projection S with random $\pm1$ entries (known as the AMS sketch): $\|SB1_C\|_2^2\approx\|B1_C\|_2^2$. As such, we may maintain the $(\mathrm{polylog}~n)\times n$ matrix SB through one pass of the data, and then apply $1_C$ to the result after receiving a cut query C.

The second half of the talk covered sketching methods for quantitative versions of these problems. For example, suppose you wanted to keep track of the k “heavy hitters” in the data’s histogram, i.e., the k members of the support of the histogram (under the assumption that they account for the bulk of the histogram’s $\ell_2$ energy). Then one may sketch using the Count Sketch algorithm. Going back to graphs, suppose you wanted to compute a graph sparsifer. Michael discussed how one can leverage graph sketches to iteratively improve crude sparsifiers by estimating “heavy hitter” edges in terms of effective resistance.