Foundations of Data Science Boot Camp, IV

This is the fourth entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Thursday, we heard talks from Santosh Vempala and Ilias Diakonikolas. Below, I link videos and provide brief summaries of their talks.

Santosh Vempala — High Dimensional Geometry and Concentration

This talk covered multiple topics, including volume distribution, the Brunn–Minkowski inequality, the Prekopa–Leindler inequality, Levy concentration, Dvoretzky’s theorem, and isoperimetry.

The discussion of volume distribution revolved around two facts.

Fact 1. Most points in a convex body reside near the boundary.

To see this, observe that for convex K\in\mathbb{R}^n of positive volume and containing 0, the volume of (1-\epsilon)K equals (1-\epsilon)^n times the volume of K, meaning a fraction of the volume resides within a multiplicative factor of 1+O(1/n) away from the boundary.

Fact 2. Most of a ball is near a central hyperplane.

For a unit euclidean ball, a fraction of the points are within O(1/\sqrt{n}) away from any central hyperplane. For the unit infinity ball, a fraction of the points are within O(1) away from a random central hyperplane.

Next, Santosh proved the Brunn–Minkowski inequality:

Theorem. Let A,B\subseteq\mathbb{R}^n be nonempty and compact. Then

\displaystyle\Big(\mathrm{vol}(A+B)\Big)^{1/n}\geq\Big(\mathrm{vol}(A)\Big)^{1/n}+\Big(\mathrm{vol}(B)\Big)^{1/n}.

(Here, A+B denotes the Minkowski sum of A and B.)

See these notes for a proof. Here’s a sketch: Since A and B can be expressed as the closure a union of disjoint open boxes (aligned with the identity basis), it suffices to induct on the number of boxes. When A and B are each composed of one box, the result follows from AMGM. When there are more boxes, there exists an axis-parallel hyperplane H such that each side of H contains at least one box. Then one may shift A or B relative to this hyperplane so that the proportion of volume from each is equal on both sides. The induction argument then goes through since each side has the same proportion with strictly fewer boxes.

Next, take any unit vector v and let A(x) denote the cross-section of points in K whose inner product with v equals x. Define the radius function r(x):=\mathrm{vol}(A(x))^{1/(n-1)}. Then the Brunn–Minkowski inequality implies that the radius function is concave.

Theorem (Grunbaum’s inequality). Let K be a convex body. Then any half-space that contains the centroid of K contains at least a 1/e fraction of the volume of K.

See these notes for a proof. The idea is to symmetrize using the radius function, and then argue that the radius function is linear in the worse-case scenario. As such, in every dimension, the worst-case body is a cone.

Next, we covered the Prekopa–Leindler inequality, which is an (equivalent) function version of Brunn–Minkowski:

Theorem. Take f,g,h\colon\mathbb{R}^n\to\mathbb{R}_{\geq0} and \lambda\in[0,1] such that

\displaystyle h\Big(\lambda x+(1-\lambda)y\Big)\geq f(x)^\lambda g(y)^{1-\lambda}.

Then \int_{\mathbb{R}^n} h\geq(\int_{\mathbb{R}^n}f)^\lambda(\int_{\mathbb{R}^n}g)^{1-\lambda}.

As an application, the marginals of any log-concave distribution is log-concave. Also, convolution preserves log-concavity. The proof is by induction on n. For n=1, write \int h=\int_t \mathrm{vol}({L_h(t)}), where L_h(t)=\{x:h(x)\geq t\}. Observe that L_h(t) contains \lambda L_f(t)+(1-\lambda)L_g(t) and proceed with AMGM. For n>1, consider a marginal and combine the induction hypotheses from dimension n-1 and dimension 1.

Next, we discussed Levy concentration:

Theorem. Suppose f\colon S^{n-1}\to\mathbb{R} is L-Lipschitz. If X is drawn uniformly from S^{n-1}, then \mathrm{Pr}(|f(X)-\mathbb{E}[f(X)]|>t)\leq e^{-nt^2/(2L^2)}.

We did not prove this theorem (a proof of a weaker form can be found here). Instead, we proved related results. Let \gamma denote the standard Gaussian measure.

Theorem. Take a measurable set A\subseteq\mathbb{R}^n and denote A_t=A+tB_2. Then \gamma(A)\gamma(A_t^c)\leq e^{-t^2/4}.

The indicator function 1_A satisfies 1_A(x)\leq e^{-t^2/4}e^{\mathrm{dist}(x,A)^2/4} for all x, so it suffices to show

\displaystyle\gamma(A)\int_{\mathbb{R}^n}e^{\mathrm{dist}(x,A)^2/4}d\gamma(x)\leq 1.

This follows from Prekopa–Leindler by taking h to be the Gaussian density, f=1_A\cdot h, g(x)=e^{\mathrm{dist}(x,A)^2/4}h(x), and \lambda=1/2. When checking this consequence, one must use the fact that \|x-y\|\geq \mathrm{dist}(y,A) when x\in A.

Next, we discussed Dvoretzky’s theorem. Here, we consider symmetric convex bodies, meaning K will be compact with nonzero volume and closed under negation. Let \|x\|_K be the smallest t for which x resides in tK. Note that \|x\|_2\leq\|x\|_1 follows from the fact that B_2 contains B_1.

Theorem (John). For every symmetric convex body K\subseteq\mathbb{R}^n, there exists an ellipsoid \mathcal{E} such that \mathcal{E}\subseteq K\subseteq\sqrt{n}\mathcal{E}.

Here, \sqrt{n} is tight, considering the case K=B_1.

Theorem. Given a symmetric convex body K and \epsilon>0, let k(K,\epsilon) denote the largest dimension k for which there exists a subspace E such that (1-\epsilon)r B_2\subseteq K\cap E \subseteq (1+\epsilon)r B_2 for some r>0. Then

(a) k(K,\epsilon) \geq c_1 \cdot\frac{\epsilon^2}{\log(1/\epsilon)}\cdot\log n,

(b) k(K,\epsilon) \geq c_2\cdot\frac{\epsilon^2}{\log(1/\epsilon)}\cdot n \cdot (\frac{M}{b})^2, and

(c) k(K,\epsilon) \geq c_3\cdot\frac{\epsilon^2}{\log(1/\epsilon)}\cdot (\frac{\ell}{b})^2.

Here, M=\mathbb{E}[\|X\|_K] for X drawn uniformly from S^{n-1}\ell=\mathbb{E}[\|X\|_K^2]^{1/2} for X drawn from the standard Gaussian, and \|x\|_K\leq b\|x\|_2 for all x.

(Note that (b) and (c) are equivalent.) For example, k(B_1,\epsilon)\geq \Omega_\epsilon(n) and k(B_\infty,\epsilon)\geq\Omega_\epsilon(\log n), which is the worst-case scenario by (a). The proof of (b) follows by applying Levy concentration to an epsilon net.

Santosh concluded by discussing isoperimetry. He posed the “avocado cutting problem” (while distributing avocados to the audience!). When you eat part of an avocado, you put the remainder in the fridge, and the next time you enjoy the avocado, all of the avocado on the boundary is bad, so you have to scrape it off. How do you cut the avocado so as to minimize the surface area? The KLS conjecture states that the Cheeger constant of any log-concave density is achieved to within a universal constant factor by hyperplane cut. See this survey for more information.

Ilias Diakonikolas — Algorithmic High Dimensional Robust Statistics

How do you perform parameter estimation when a fraction of the data is unreliable? This is the motivating question of robust statistics, and there are many applications (model misspecification, outlier removal, reliable/adversarial/secure ML). There are several models for how the data can become unreliable, but this talk focused on two models: (1) Huber’s contamination model is a mixture between a “good” distribution and some arbitrary “bad” distribution (this might be appropriate in the model misspecification setting), and (2) one may consider an adversarial model in which someone looks at the current data and your estimation algorithm and then adds data to ruin your estimation (this might be appropriate in the data poisoning setting, for example).

This talk primarily considered the problem of robustly estimating the mean of a spherical Gaussian, which was essentially solved in 2016 (see this and that). In the 1-dimensional case, the median serves as a robust estimator, but in higher-dimensional cases, the best known estimators (before 2016) either had suboptimal error rates or required super-polynomial runtimes. The solutions to this problem take signal from the covariance matrix to discern whether the sample mean is a good estimator. Intuitively, if the covariance matrix is spectrally close to the identity, then we can expect the empirical mean to be close to the population mean. This holds even after adversarially adding and/or removing a fraction of data. As such, if there is corrupt data, one could remove data points until the covariance matrix is close to the identity. For example, one may appeal to the Gaussian annulus theorem to remove outliers, or iteratively project the data onto the leading eigenvector of the sample covariance matrix to identify points whose removal would move the covariance matrix closer to the identity. These are the key ideas used in these algorithms.

The only potential sub-optimality in the existing solutions is in the adversarial noise model, since the error rate is “only” optimal for sub-Gaussian distributions. However, there are statistical query lower bounds for reaching the information-theoretic optimal error rate, suggesting this problem is hard. Ilias indicated that (despite the scope of his talk) we don’t actually need to know the covariance in order to robustly estimate the mean. He also posed the open (but soon-to-be-closed) problem of robustly learning a mixture of two arbitrary Gaussians, and he concluded with a much broader problem: How can we approach a general algorithmic theory of robustness?

Advertisements

2 thoughts on “Foundations of Data Science Boot Camp, IV

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s