Click here for a draft of my lecture notes.

The current draft consists of a chapter on convex optimization. I will update the above link periodically. Feel free to comment below.

**UPDATE #1:** Lightly edited Chapter 1 and added a chapter on probability.

**1. SqueezeFit: Label-aware dimensionality reduction by semidefinite programming.**

Suppose you have a bunch of points in high-dimensional Euclidean space, some labeled “cat” and others labeled “dog,” say. Can you find a low-rank projection such that after projection, cats and dogs remain separated? If you can implement such a projection as a sensor, then that sensor collects enough information to classify cats versus dogs. This is the main idea behind compressive classification.

At it’s heart, this problem concerns linear dimensionality reduction. For the sake of illustration, suppose we want to find an appropriate projection for this dataset:

The gut-instinct method of dimensionality reduction is PCA, but this delivers poor results:

Of course, PCA ignores labels. Instead, you could run PCA on the differences between points of different labels, but in this case, you’d still get a dominant z-component, so this doesn’t help. Alternatively, you could run LDA, which projects onto the difference between class centroids (times an inverse covariance matrix). This also produces poor results:

Intuitively, we want to thumb through all possible projections to find a good one. This is what **SqueezeFit** does:

In particular, SqueezeFit is an SDP relaxation of the problem “find the minimum-rank orthogonal projection that keeps points of different labels separated.” In the paper, we prove some performance guarantees before applying SqueezeFit to real data. Overall, SqueezeFit provides an improvement over the standard approaches for linear dimensionality reduction. We’re excited to apply variants of SqueezeFit to various important settings.

**2. Utility Ghost: Gamified redistricting with partisan symmetry.**

There has been a lot of effort lately to use mathematical tools to help detect partisan gerrymandering. However, any detection procedure requires a technical definition of “excessively favoring one party over the other.” Any choice of definition can be perceived as arbitrary, or even sociological gobbledygook.

Considering this difficulty of fighting partisan gerrymandering in the courts, one might instead prevent gerrymandering from happening in the first place. Almost half of the states in the country use some sort of redistricting commission to draw a new map after the decennial census. Gamified redistricting offers a protocol for a bipartisan redistricting commission that leads to provably beneficial results. For example, the I-cut-you-freeze protocol is a modification of the I-cut-you-choose solution to the fair cake-cutting problem that provides a beautiful votes–seats curve in the limit as the number of districts goes to infinity (see Figure 1 in the paper).

Sadly, for smaller numbers of districts (which we frequently encounter in the real world), I-cut-you-freeze gives significant advantage to the first player. As an alternative, we propose **Utility Ghost**, which is a modification of the word game Ghost in which players take turns assigning precincts to districts. In the paper, we prove that in an idealized setting, if both players have the same number of votes, then under optimal play, they end up with the same number of seats.

We also show that Utility Ghost performs well in real-world settings. For example, consider the case of New Hampshire, which is made up of 10 counties and two U.S. congressional districts. If we don’t split counties, there are seven ways to partition New Hampshire into two districts with roughly the same sized population:

Here, proto-districts are colored according to the 2016 presidential election returns, where in New Hampshire, Hillary Clinton received 47.62 percent of the vote and Donald Trump received 47.25 percent. Since there are only two districts, the I-cut-you-freeze protocol is not helpful: the first player becomes *de facto* map maker, while the second player has no say in the matter. In particular, if the Democrats play first, they get to select the map in which they win both seats. This seems unfair, considering half of the voters are Republican. Alternatively, Utility Ghost avoids this map under optimal play, regardless of who plays first. Time will tell whether such gamified redistricting will be incorporated in protocols for bipartisan commissions following Census 2020.

**3. Derandomizing compressed sensing with combinatorial design.**

In compressed sensing, we encode sparse signals with random measurements, and then reconstruct the signals using L1 minimization. Here, the number of random measurements scales roughly linearly with the sparsity level. There has been some work to replicate this encoding performance with deterministic measurements, but the best theory to date requires a number of measurements that scales almost quadratically with the sparsity level.

Instead, one might attempt to minimize the number of random bits needed to accomplish the desired linear scaling. To this end, a previous paper leveraged pseudorandom properties of the Legendre symbol to derandomize sensing matrices composed of entries. Our new paper provides a more general treatment of derandomization for compressed sensing. As a special instance of our result, we can accommodate entries by sampling rows from an orthogonal array. The resulting sensing matrix uses slightly fewer measurements and slightly more randomness than the Legendre symbol–based construction. Our methods also provide derandomization by sampling from mutually unbiased bases.

In practice, reconstruction performance is identical to the fully random counterparts:

Still, our sensing matrices require a number of random bits that fail to break the “Johnson–Lindenstrauss bottleneck” identified in this paper. Is this a fundamental barrier to derandomized compressed sensing?

]]>Let . Suppose you have distinct numbers in some field. Is it necessarily possible to arrange the numbers into an matrix of full rank?

Boris’s problem was originally inspired by a linear algebra exam problem at Princeton: Is it possible arrange four distinct prime numbers in a rank-deficient matrix? (The answer depends on whether you consider to be prime.) Recently, Boris reminded me of his email, and I finally bothered to solve it. His hint: Apply the combinatorial nullstellensatz. The solve was rather satisfying, and if you’re reading this, I highly recommend that you stop reading here and enjoy the solve yourself.

**Theorem.** Pick any field , fix , and let be any subset of size . Then there exists a full rank matrix such that every member of appears as an entry of .

**Proof:** Consider the homogeneous polynomial defined by

where means that either or but . Notice that precisely when is invertible with all distinct entries. We will identify a term of such that and so that the result follows immediately from the combinatorial nullstellensatz.

First, we observe that is a Vandermonde polynomial, and so the Leibniz formula of the corresponding determinant gives

Next, consider the term of corresponding to the identity permutation and the term of corresponding to any permutation satisfying for every . The product of these terms has exponent , where denotes the matrix representation of (namely, the identity matrix).

We claim that with and only if and , meaning there exists such that is a term of . To see this, first note that is uniquely minimized at , thereby forcing . Next, the remainder of is uniquely minimized at , thereby forcing . Continuing in this way, we obtain for every . Since for every , this forces , meaning , and so , i.e., .

Finally, since , and so we are done.

]]>Here’s a brief summary of the progress made in the previous thread:

– Let w(k) denote the supremum of w such that is k-colorable. Then of course and for every . Furthermore,

Colorings that produce these lower bounds are depicted here. The upper bound for k=3 is given here.

– The largest known k-colorable disks for k=2,3,4,5 are depicted here.

Presumably, we can obtain descent upper bounds on w(4) by restricting (a finite subset of) the ring to an infinite strip.

]]>**Ilya Razenshteyn — Nearest Neighbor Methods**

The nearest neighbor search problem first preprocesses n points P in some metric space with a distance scale r>0, and then when queried a new point q in the metric space, the output should be a member of P that is within r of q. To solve this problem, the best known solution leans on substantial preprocessing that requires exponential space. However, many settings don’t require the output to be within r of q, but within cr of q for some approximation parameter c>1. This motivates approximate nearest neighbor (ANN) search. In his talk, Ilya discussed data-oblivious methods and various data-aware methods. The talk moved from the Hamming metric space, to , to , to , and finally to general metric spaces.

To solve ANN for the Hamming metric space, randomly sample k coordinates, and consider all points that exactly match q in these coordinates. We can scale k so that there are O(1) points that match this specification, and then we determine which of these points is closest to q. We may repeat this process times to succeed with probability 0.99. In general, one might consider a locality-sensitive hash (LSH), which is a random partition of the metric space such that nearby points collide with high probability and far-away points are separated with high probability. The above coordinate-sampling hash is an example of LSH, and can be extended to . For the sphere in , one may partition according to random hyperplanes (producing intuitive, but sub-optimal, results), or draw points at random from the unit sphere and then carving out space according to their Voronoi regions.

For a data-dependent alternative, we first observe that Voronoi LSH works best when the data points are well distributed on the sphere. If instead, the data has clustered portions of data, we may remove those clusters until the remaining data set is well distributed on the sphere and apply Voronoi LSH. Then we can recurse on the clusters. See this paper for more information.

For more general metrics, one is inclined to leverage a bi-Lipschitz embedding into or , but this is not feasible for many metric spaces. Sometimes, it’s easier to embed into , and for this space, there is a data-dependent ANN algorithm for (see this paper). This is based on a fundamental dichotomy: for every n-point dataset, there is either a cube of size containing points, or there exists a coordinate that splits the dataset into balanced parts. This dichotomy suggests a recursive ANN algorithm.

The above dichotomy can be replicated for general metric spaces. Define the cutting modulus to be the smallest number such that for every n-vertex graph embedding into the space such that edges have length at most K, either there is a ball of radius containing vertices, or the graph has an -sparse cut. While the cutting modulus is difficult to compute for general metric spaces, it brings a nice intuition: ANN is “easy” for metric spaces that don’t contain large expanders. See this paper for more information. Unfortunately, this result uses the cell-probe model of computation, which can substantially underestimate runtime. In order to move beyond this model, one would need to somehow ensure that there exist “nice cuts” at each iteration, e.g., the coordinate-based cuts in the original case.

**Michael Kapralov — Data streams**

In many applications, the dataset is so large that we can only look at the data once, and we are stuck with much smaller local memory. What sort of statistics about the data can be approximated with such constraints?

The first problem of this sort that Michael discussed was the distinct elements problem. Here, we are told that every member of the data lies in , and given a single pass of the data with only storage, we are expected to approximate the number of distinct elements seen (i.e., the size of the support of the histogram of the data) within a factor of and with success probability . To solve this, first pass to a decision version of the problem: Can you tell whether the number bigger than or smaller than for some T? To solve this, just pick with iid. Then we can maintain a count of how much of the data lies in S. (A positive count is highly unlikely if the desired number is less than T.) We can boost this signal by taking independent choices of S.

Now consider a problem in which we stream edges in a graph, and after seeing all of the edges, we are asked to approximate the size of some cut in the graph. (The cut query comes after seeing the edges!) Consider the matrix B with rows indexed by pairs of vertices and columns indexed by vertices. Say the row of B indexed by is if is an edge in the graph, and zero otherwise (the overall sign doesn’t matter here). Then for every subcollection of vertices, the number of edges in the graph between and equals the squared 2-norm of , which can be maintained with the help of a JL projection S with random entries (known as the AMS sketch): . As such, we may maintain the matrix SB through one pass of the data, and then apply to the result after receiving a cut query C.

The second half of the talk covered sketching methods for quantitative versions of these problems. For example, suppose you wanted to keep track of the k “heavy hitters” in the data’s histogram, i.e., the k members of the support of the histogram (under the assumption that they account for the bulk of the histogram’s energy). Then one may sketch using the Count Sketch algorithm. Going back to graphs, suppose you wanted to compute a graph sparsifer. Michael discussed how one can leverage graph sketches to iteratively improve crude sparsifiers by estimating “heavy hitter” edges in terms of effective resistance.

]]>**Santosh Vempala — High Dimensional Geometry and Concentration**

This talk covered multiple topics, including volume distribution, the Brunn–Minkowski inequality, the Prekopa–Leindler inequality, Levy concentration, Dvoretzky’s theorem, and isoperimetry.

The discussion of volume distribution revolved around two facts.

**Fact 1.** Most points in a convex body reside near the boundary.

To see this, observe that for convex of positive volume and containing 0, the volume of equals times the volume of , meaning a fraction of the volume resides within a multiplicative factor of away from the boundary.

**Fact 2.** Most of a ball is near a central hyperplane.

For a unit euclidean ball, a fraction of the points are within away from any central hyperplane. For the unit infinity ball, a fraction of the points are within O(1) away from a random central hyperplane.

Next, Santosh proved the Brunn–Minkowski inequality:

**Theorem.** Let be nonempty and compact. Then

(Here, A+B denotes the Minkowski sum of A and B.)

See these notes for a proof. Here’s a sketch: Since A and B can be expressed as the closure a union of disjoint open boxes (aligned with the identity basis), it suffices to induct on the number of boxes. When A and B are each composed of one box, the result follows from AMGM. When there are more boxes, there exists an axis-parallel hyperplane H such that each side of H contains at least one box. Then one may shift A or B relative to this hyperplane so that the proportion of volume from each is equal on both sides. The induction argument then goes through since each side has the same proportion with strictly fewer boxes.

Next, take any unit vector v and let A(x) denote the cross-section of points in K whose inner product with v equals x. Define the radius function . Then the Brunn–Minkowski inequality implies that the radius function is concave.

**Theorem** (Grunbaum’s inequality). Let K be a convex body. Then any half-space that contains the centroid of K contains at least a 1/e fraction of the volume of K.

See these notes for a proof. The idea is to symmetrize using the radius function, and then argue that the radius function is linear in the worse-case scenario. As such, in every dimension, the worst-case body is a cone.

Next, we covered the Prekopa–Leindler inequality, which is an (equivalent) function version of Brunn–Minkowski:

**Theorem.** Take and such that

Then

As an application, the marginals of any log-concave distribution is log-concave. Also, convolution preserves log-concavity. The proof is by induction on n. For n=1, write , where . Observe that contains and proceed with AMGM. For n>1, consider a marginal and combine the induction hypotheses from dimension n-1 and dimension 1.

Next, we discussed Levy concentration:

**Theorem.** Suppose is -Lipschitz. If is drawn uniformly from , then .

We did not prove this theorem (a proof of a weaker form can be found here). Instead, we proved related results. Let denote the standard Gaussian measure.

**Theorem.** Take a measurable set and denote . Then .

The indicator function satisfies for all x, so it suffices to show

This follows from Prekopa–Leindler by taking h to be the Gaussian density, , , and . When checking this consequence, one must use the fact that when .

Next, we discussed Dvoretzky’s theorem. Here, we consider symmetric convex bodies, meaning K will be compact with nonzero volume and closed under negation. Let be the smallest t for which x resides in tK. Note that follows from the fact that contains .

**Theorem** (John). For every symmetric convex body , there exists an ellipsoid such that .

Here, is tight, considering the case .

**Theorem.** Given a symmetric convex body K and , let denote the largest dimension k for which there exists a subspace E such that for some r>0. Then

(a) ,

(b) , and

(c) .

Here, for X drawn uniformly from , for X drawn from the standard Gaussian, and for all x.

(Note that (b) and (c) are equivalent.) For example, and , which is the worst-case scenario by (a). The proof of (b) follows by applying Levy concentration to an epsilon net.

Santosh concluded by discussing isoperimetry. He posed the “avocado cutting problem” (while distributing avocados to the audience!). When you eat part of an avocado, you put the remainder in the fridge, and the next time you enjoy the avocado, all of the avocado on the boundary is bad, so you have to scrape it off. How do you cut the avocado so as to minimize the surface area? The KLS conjecture states that the Cheeger constant of any log-concave density is achieved to within a universal constant factor by hyperplane cut. See this survey for more information.

**Ilias Diakonikolas — Algorithmic High Dimensional Robust Statistics**

How do you perform parameter estimation when a fraction of the data is unreliable? This is the motivating question of robust statistics, and there are many applications (model misspecification, outlier removal, reliable/adversarial/secure ML). There are several models for how the data can become unreliable, but this talk focused on two models: (1) Huber’s contamination model is a mixture between a “good” distribution and some arbitrary “bad” distribution (this might be appropriate in the model misspecification setting), and (2) one may consider an adversarial model in which someone looks at the current data and your estimation algorithm and then adds data to ruin your estimation (this might be appropriate in the data poisoning setting, for example).

This talk primarily considered the problem of robustly estimating the mean of a spherical Gaussian, which was essentially solved in 2016 (see this and that). In the 1-dimensional case, the median serves as a robust estimator, but in higher-dimensional cases, the best known estimators (before 2016) either had suboptimal error rates or required super-polynomial runtimes. The solutions to this problem take signal from the covariance matrix to discern whether the sample mean is a good estimator. Intuitively, if the covariance matrix is spectrally close to the identity, then we can expect the empirical mean to be close to the population mean. This holds even after adversarially adding and/or removing a fraction of data. As such, if there is corrupt data, one could remove data points until the covariance matrix is close to the identity. For example, one may appeal to the Gaussian annulus theorem to remove outliers, or iteratively project the data onto the leading eigenvector of the sample covariance matrix to identify points whose removal would move the covariance matrix closer to the identity. These are the key ideas used in these algorithms.

The only potential sub-optimality in the existing solutions is in the adversarial noise model, since the error rate is “only” optimal for sub-Gaussian distributions. However, there are statistical query lower bounds for reaching the information-theoretic optimal error rate, suggesting this problem is hard. Ilias indicated that (despite the scope of his talk) we don’t actually need to know the covariance in order to robustly estimate the mean. He also posed the open (but soon-to-be-closed) problem of robustly learning a mixture of two arbitrary Gaussians, and he concluded with a much broader problem: How can we approach a general algorithmic theory of robustness?

]]>This is the third entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Wednesday, we heard talks from Fred Roosta and Will Fithian. Below, I link videos and provide brief summaries of their talks.

**Fred Roosta — Stochastic Second-Order Optimization Methods**

While first-order optimization makes use of (an approximation to) the gradient (e.g., gradient descent), second-order optimization also makes use of (an approximation to) the Hessian (e.g., Newton’s method). For second-order optimization, the iterations converge faster, but the per-iteration cost is greater. Even gradient descent is time prohibitive in the “big data” regime, since the objective F(x) to minimize is a sum of so many components . Recall that stochastic gradient descent estimates the gradient at each iteration by sampling the components. One may also sample components to estimate the Hessian. In fact, you can provably get away with each (smooth) component being convex provided the overall objective is strongly convex. See this paper for more information. In the convex but non-smooth case, you can apply a proximal Newton-type method (see this paper), and in the convex but semi-smooth case, this can be modified using Clarke’s generalized Jacobian. (I’m not sure how much is known about stochastic versions of these methods.)

If you haven’t seen it, check out the secant method for approximating roots of a function. The idea is to modify Newton’s root-finding algorithm by replacing derivatives with finite differences of successive iterations, producing cheaper iterations at the expense of the convergence rate (the golden ratio appears!). Quasi-Newton methods use this same idea to speed up Newton’s method, the most popular variant being BFGS. Other second-order methods leverage other estimates of the Hessian. In scientific computing, the objective function frequently takes the form , where f is convex. In this case, the Hessian of F can be approximated as , where H is the Hessian of f and J is the Jacobian of h; this approximation is used in Gauss–Newton. In machine learning, it is apparently common to use a Fisher information matrix as a proxy to the Hessian, resulting in an algorithm called natural gradient descent.

At the end, Fred posed an open problem: What is the average-case complexity (or smoothed complexity) of global convergence of Newton’s method with a line search? (He observes that real-world instances are always much faster than the standard worst-case guarantees suggest.)

**Will Fithian — Statistical Inference**

The purpose of this talk was to give a broad overview of the basics of statistical inference. Given a random variable from some parameterized distribution, an estimate is a function of the random variable that is intended to be close to some function of the distribution’s parameter. This closeness is measured by a loss function (such as the square of the difference), and the expected value of this loss (computed over the measure determined by a given choice of parameter) is called risk. Each estimator determines a risk function that varies with the parameter. To compare estimators, one may summarize this function with a scalar, either by averaging over a prior distribution over the parameter space (resulting in Bayes risk), or by computing the worst-case risk (called minimax risk). Another way to select an estimator is to restrict to unbiased estimators, which have the property that the expected value of the estimator equals the desired quantity. The modern approach is manage bias instead focusing on unbiased estimators, since by the bias–variance tradeoff, one may obtain a low-variance alternative. The Bayesian approach to estimation incorporates a prior “belief” of what the parameter is, and the minimizer of Bayes risk (in terms of the MSE loss) is simply the posterior mean. In some applications, it is difficult to identify a useful prior.

When designing an estimator for a hypothesis test, the first priority is to minimize Type I error in the worst case (quantified by the significance level), and the second priority is to minimize Type II error. Minimizing Type II error is equivalent to maximizing power, but the power should be thought of as a function of the true parameter. Will pointed out that confidence intervals are more informative than hypothesis tests, and you can view it as the outcome of infinitely many simultaneous hypothesis tests. This can be phrased in terms of a duality between confidence sets and hypothesis tests. Will concluded by discussing the maximum likelihood estimator, sketching an argument for its consistency and asymptotic normality.

]]>This is the second entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Tuesday, we heard talks from Ken Clarkson, Rachel Ward, and Michael Mahoney. Below, I link videos and provide brief summaries of their talks.

**Ken Clarkson — Sketching for Linear Algebra: Randomized Hadamard, Kernel Methods**

This talk introduced leverage scores and their application a few settings. Say S is a row sampling matrix if each row is a multiple of a row of the identity matrix. Given a matrix A, we want a random row sampling matrix S such that (with high probability) SAx has approximately the name norm as Ax for every x. As a general construction, assign to each row of A a probability , and let S be a random matrix such that is a row of S with probability (independently for each i). Notice that S has a random number of rows, and the expected number of rows is . It’s easy to show that for each x, . In order to concentrate around this mean, we want to minimize variance, which can be accomplished by tuning the ‘s to reflect the “importance” of the rows of A. Intuitively, one might select to be proportional to the square of the norm of the corresponding row of A. (This works well when is a multiple of the identity.) In general, we want to select so as to bound the maximum relative contribution of the th component: for every x. Then one may conclude concentration by passing to Bernstein’s inequality. In fact, if we write A=UR with , then we may take to be the square of the norm of the corresponding row of U. These quantities are called leverage scores. Notice that picking in this way makes it so that the average number of rows in S equals the dimension of x.

Readers of this blog are perhaps familiar with the transformation as converting to a Parseval frame. Ken referred to the resulting row vectors as exhibiting “isotropic position,” which appear to be geometrically natural. In addition, the use of leverage scores in sampling appears, for example, in Spielman–Srivastava graph sparsification. Overall, the pursuit of leverage scores is well motivated. Moreoever, we can quickly estimate them to within a factor of by taking the QR decomposition of a sketch of the matrix (as in David Woodruff’s talk). Such a coarse estimation is sufficient for sampling purposes, and recent work shows that higher-precision estimates require much longer runtimes. Ken concluded by sketching Tropp’s analysis of the subsampled randomized Hadamard transform in terms of leverage scores.

**Rachel Ward — First-Order Stochastic Optimization**

Rachel motivated the use of stochastic gradient descent (SGD) in machine learning. The main idea is that the objective function is a sum of many component functions, each component corresponding to a different data point. When there are many data points, it is time prohibitive to look at all of them to perform gradient descent. Instead, we “sketch” the gradient by grabbing random component functions, which gives SGD. There are two main theoretical questions: (1) how to pick the step size, and (2) how to select random component functions.

For (1), we note that, intuitively, if the step size is constant, then iterations will eventually bounce around a local minimizer. This intuition can be made formal with appropriate hypotheses: Assuming the component functions are sufficiently consistent (meaning they are all minimized “somewhat simultaneously”) and the overall objective is smooth and strongly convex, then one may show that for any given tolerance , there exists a constant step size (determined by component consistency, smoothness, strong convexity, and ) such that the SGD iterates are expected to be within of the optimizer after an appropriate number of iterations (see this paper).

For (2), we note that, again intuitively, certain components may be more important than the others when computing the gradient. For example, in the least-squares setting, leverage scores may signal the importance of a given component. In general, the Lipschitz constant of each component seems like a good proxy for importance. Rachel finds that a good choice of weights is a mixture of the uniform distribution and the distribution proportional to the Lipschitz constants. In some cases, this choice makes convergence much faster.

Rachel concluded with some open problems. In the real world, you don’t have access to Lipschitz constants, etc., so how can you learn good step sizes and sampling weights for SGD? For the step size problem, Rachel has some recent work along these lines (see this paper).

**Michael Mahoney — Sampling for Linear Algebra, Statistics and Optimization**

This talk covered several topics with a couple of key themes. One theme: Michael pointed out a difference between “the world” and “the machine.” Specifically, we collect data from “the world” and plug them into “the machine.” When the data is big, we use ideas from random numerical linear algebra sketch the data and estimate quantities about the data, e.g., approximately solving the least squares problem. However, the primary goal of statistics/ML is to make inferences/predictions about the world. As such, it is important to be cognizant of how our computational tools will be used in practice so that we can meet the demand with appropriate performance guarantees, etc. This perspective leads to a second theme of Michael’s talk: Whether one approach is better than another almost always depends on the broacher statistical/ML objectives. For example: Is it better to do random projection or PCA? It depends.

Of all of the topics Michael covered, I was most interested in his take on approximate matrix multiplication. Here, let denote the th column of A, and the th row of B. Then . We can estimate this sum with a sum over a random sample of terms, and we obtain concentration by biasing our sample towards the important terms, e.g., the terms with the largest norm. In the end, one can control the relative error in the implied randomized algorithm. (I wonder if this low-precision estimate can be promoted to a high-precision estimate with a cheap iterative algorithm?)

]]>Here’s a brief summary of the progress made in the previous thread:

– We have new results in the probabilistic formulation, namely, Proposition 36 and Lemmas 38 and 39.

– Jaan Parts refined Pritikin’s analysis to prove that every unit-distance graph with at most 24 vertices is 5-colorable, and every such graph with at most 6906 vertices is 6-colorable.

– Domotor, Frankl and Hubai showed that, if there exists a k-chromatic unit-distance graph with a bichromatic origin such that all its neighbors are in the upper half-plane and all their coordinates are rational, then CNP is at least k.

At the moment, there are a few outstanding SAT instances that we’d like to resolve:

– Philip Gibbs proposed a couple of graphs as possibly 6-chromatic (see this and that). Are either of these 5-colorable?

– Can we find a tile-based 6-coloring of the plane? Such a coloring must satisfy several necessary conditions.

– Aubrey has suggested a new take on Boris’ pixelated colorings by SAT solver. I have heard offline that the details of this approach are evolving.

]]>**Ravi Kannan — Foundations of Data Science**

Both parts of this talk covered various aspects of data science. A more extensive treatment of these aspects are provided in his book with John Hopcroft (available for free here).

Ravi started by describing concentration of measure. Here, he sketched a general (possibly folklore) result: If the first k moments of a mean-zero random variable are appropriately small, then the sum of n independent copies of this random variable is small with appropriately high probability. This struck me as a finite-k version of the inequality (1.3) given here.

Next, he focused on dimensionality reduction. Here, there are two main methods of interest: random projection and PCA. Random projection is of particular interest to theory since it preserves all distances in the data set (unlike PCA), whereas PCA is more popular in practice since the distortion is much lower, although not guaranteed for all pairwise distances. At this point, he sketched the Dasgupta–Gupta proof of JL. Ravi described the k-means problem as a potential application of JL. Next, he motivated PCA with a toy problem: Given a mixture of two gaussians with identity covariance and means 0 and x, where x has norm O(1), how can one estimate x? Here, random projection and k-means clustering (with k=2) both fail miserably, but PCA/SVD works great.

Ravi started the second part of this talk with the application of clustering. How do you estimate the means in a mixture of k gaussians with identity covariance? Distance-based clustering requires a distance of at least , where d is the dimension. As you might expect, you can find the subspace spanned by the gaussian’s means by running PCA/SVD. He also took some time to discuss proximity as a data hypothesis for clustering in lieu of a stochastic data model (see this paper, for example).

Next, Ravi covered randomized algorithms for various linear algebra routines. The key ideas here are to represent a matrix by a sketch of randomly selected rows and columns, and then leverage this sketch to approximate the desired quantities. He stressed the need to randomly sample with a probability distribution that’s proportional to the square of the norm of the row/column vectors (see this paper, for example).

Ravi concluded by discussing Markov chains. Here, he had two main points: First, no one computing a stationary distribution for a Markov chain should care about whether the Markov chain is periodic, since this technicality can be smoothed out by taking running averages. Second, the mixing time to stationarity is inversely proportional to the square of the conductance (or Cheeger constant) of the graph.

**David Woodruff — Sketching for Linear Algebra: Basics of Dimensionality Reduction and CountSketch**

The purpose of this talk was to illustrate the utility of sketching for linear regression and low-rank matrix approximation. David started by reviewing the basics of linear regression: Linear least squares corresponds to the MLE in the gaussian noise case, enjoys a nice geometric interpretation in terms of orthogonal projections, and also has a nice algebraic expression in terms of the normal equations. However, solving the normal equations can take a while when there are many data points to fit.

To overcome this runtime bottleneck, one can consider a sketch-and-solve approach: Use randomness to pass to a smaller instance of the same problem, solve that problem exactly, and then conclude that the corresponding solution to the original problem is approximately optimal with high probability. For example, instead of finding a least-squares solution to Ax=b, one may instead solve SAx=Sb, where S is short and fat but SA is still tall and skinny. If S has iid gaussian entries, then one may exploit JL over an epsilon net to conclude that S preserves the norms of every point in the column space of A (plus the span of b), meaning is within a factor of of for all , and so the minimizer of the first is within a factor of of optimal for the second. Sadly, picking S in this way does not lead to speedups since multiplying SA takes too long. Instead, one may take S to be a subsampled randomized Hadamard transform (see this paper).

In the case where A is sparse, one might want the runtime to scale with the number of nonzero entries of A. In this case, one should instead design S according to CountSketch. Here, if A is , then S is chosen to be with , with each column being drawn uniformly from the columns of the identity matrix, and randomly signed. Here, k scales like for birthday paradox reasons.

The second part of David’s talk started by discussing a high-precision version of the regression result. In particular, one may convert the factor in the runtime to . To do this, he solves least squares by gradient descent, and to do this, he first leverages sketching to find a good initial point, and then he leverages sketching again to precondition A, thereby making the gradient descent iterations converge appropriately quickly. To precondition, leverage the QR factorization of SA and note that the condition number of is at most . (To see this, compare the norm of to the norm of .)

The end of his talk discussed low-rank approximation. Here, the goal is to use sketching to speed up SVD. To do so, David first computes SA, then projects the rows of A onto the rowspace of SA, and then runs SVD on the resulting vectors. Here, the bottleneck is projection, which can be encoded as an instance of least squares, meaning the above technology transfers to this setting. He concluded his talk by explaining why you can expect a nearly-optimal solution to reside in the rowspace of SA when S is something called an affine embedding (a generalization of subspace embedding).

]]>