## MATH 8610: Mathematics of Data Science

This spring, I’m teaching a graduate-level special topics course called “Mathematics of Data Science” at the Ohio State University. This will be a research-oriented class, and in lecture, I plan to cover some of the important ideas from convex optimization, probability, dimensionality reduction, clustering, and sparsity.

The current draft consists of a chapter on convex optimization. I will update the above link periodically. Feel free to comment below.

UPDATE #1: Lightly edited Chapter 1 and added a chapter on probability.

UPDATE #2: Lightly edited Chapter 2 and added a section on PCA.

UPDATE #3: Added a section on random projection.

UPDATE #4: Lightly edited Chapter 3. The semester is over, so I don’t plan to update these notes again until I teach a complementary special topics course next year.

UPDATE #5: As mentioned above, I’m teaching a complementary installment of this class this semester. I fixed several typos throughout, and I added a new section on embeddings from pairwise data.

UPDATE #6: Added a section on the clique problem.

UPDATE #7: Added a section on the Lovasz number.

UPDATE #8: Added a section on planted clique.

UPDATE #9: Added sections on maximum cut and minimum normalized cut.

UPDATE #10: Added a section on k-means clustering.

UPDATE #11: Started a chapter on compressed sensing.

UPDATE #12: Started a section on uniform guarantees.

UPDATE #13: Started a chapter on matrix analysis.

UPDATE #14: Started a section on matrix representations.

UPDATE #15: Started a section on spectral theory.

UPDATE #16: Added to the section on spectral theory.

UPDATE #17: Added more to the section on spectral theory.

UPDATE #18: Added even more to the section on spectral theory.

UPDATE #19: Finished the section on spectral theory and added a section on tensors.

UPDATE #20: Finished the section on tensors.

UPDATE #21: Added a section on random graphs.

## A few paper announcements

This last semester, I was a long-term visitor at the Simons Institute for the Theory of Computing. My time there was rather productive, resulting in a few (exciting!) arXiv preprints, which I discuss below.

1. SqueezeFit: Label-aware dimensionality reduction by semidefinite programming.

Suppose you have a bunch of points in high-dimensional Euclidean space, some labeled “cat” and others labeled “dog,” say. Can you find a low-rank projection such that after projection, cats and dogs remain separated? If you can implement such a projection as a sensor, then that sensor collects enough information to classify cats versus dogs. This is the main idea behind compressive classification.

## Foundations of Data Science Boot Camp, V

This is the fifth (and final) entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Friday, we heard talks from Ilya Razenshteyn and Michael Kapralov. Below, I link videos and provide brief summaries of their talks.

Ilya Razenshteyn — Nearest Neighbor Methods

## Foundations of Data Science Boot Camp, IV

This is the fourth entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Thursday, we heard talks from Santosh Vempala and Ilias Diakonikolas. Below, I link videos and provide brief summaries of their talks.

Santosh Vempala — High Dimensional Geometry and Concentration

## Foundations of Data Science Boot Camp, III

This is the third entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Wednesday, we heard talks from Fred Roosta and Will Fithian. Below, I link videos and provide brief summaries of their talks.

Fred Roosta — Stochastic Second-Order Optimization Methods

## Foundations of Data Science Boot Camp, II

This is the second entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Tuesday, we heard talks from Ken Clarkson, Rachel Ward, and Michael Mahoney. Below, I link videos and provide brief summaries of their talks.

Ken Clarkson — Sketching for Linear Algebra: Randomized Hadamard, Kernel Methods

## Foundations of Data Science Boot Camp

I’m spending the semester at the Simons Institute for the Theory of Computing as part of the program on Foundations of Data Science. This was the first day of the “boot camp” week, which was organized to acquaint program participants with the key themes of the program. Today, we heard talks from Ravi Kannan and David Woodruff. Below, I link videos and provide brief summaries of their talks.

Ravi Kannan — Foundations of Data Science

## Monte Carlo approximation certificates for k-means clustering

This week, I visited Afonso Bandeira at NYU to give a talk in the MaD seminar on the semidefinite relaxation of k-means. Here are the slides. The last part of the talk is very new; I worked it out with Soledad Villar while she visited me a couple weeks ago, and our paper just hit the arXiv. In this blog entry, I’ll briefly summarize the main idea of the paper.

Suppose you are given data points $\{x_i\}_{i\in T}\subseteq\mathbb{R}^m$, and you are tasked with finding the partition $C_1\sqcup\cdots\sqcup C_k=T$ that minimizes the k-means objective

$\displaystyle{\frac{1}{|T|}\sum_{t\in[k]}\sum_{i\in C_t}\bigg\|x_i-\frac{1}{|C_t|}\sum_{j\in C_t}x_j\bigg\|^2\qquad(T\text{-IP})}$

(Here, we normalize the objective by $|T|$ for convenience later.) To do this, you will likely run MATLAB’s built-in implementation of k-means++, which randomly selects $k$ of the data points (with an intelligent choice of random distribution), and then uses these data points as proto-centroids to initialize Lloyd’s algorithm. In practice, this works very well: After running it a few times, you generally get a very nice clustering. But when do you know to stop looking for an even better clustering?

## Talks from the Summer of ’17

This summer, I participated in several interesting conferences. This entry documents my slides and describes a few of my favorite talks from the summer. Here are links to my talks:

UPDATE: SIAM AG17 just posted a video of my talk.

Now for my favorite talks from FoCM, ILAS, SIAM AG17 and SPIE:

Ben RechtUnderstanding deep learning requires rethinking generalization

In machine learning, you hope to fit a model so as to be good at prediction. To do this, you fit to a training set and then evaluate with a test set. In general, if a simple model fits a large training set pretty well, you can expect the fit to generalize, meaning it will also fit the test set. By conventional wisdom, if the model happens to fit the training set exactly, then your model is probably not simple enough, meaning it will not fit the test set very well. According to Ben, this conventional wisdom is wrong. He demonstrates this by presenting some observations he made while training neural nets. In particular, he allowed the number of parameters to far exceed the size of the training set, and in doing so, he fit the training set exactly, and yet he still managed to fit the test set well. He suggested that generalization was successful here because stochastic gradient descent implicitly regularizes. For reference, in the linear case, stochastic gradient descent (aka the randomized Kaczmarz method) finds the solution of minimal 2-norm, and it converges faster when the optimal solution has smaller 2-norm. Along these lines, Ben has some work to demonstrate that even in the nonlinear case, fast convergence implies generalization.

## Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Vlad Voroninski recently posted an arXiv preprint with Paul Hand that provides compressed sensing guarantees using a neural net-based generative signal model. This offers some theoretical justification for the shocking empirical results presented in the “Compressed sensing using generative models” paper, which demonstrates compressed sensing with 10 times fewer measurements than conventional compressed sensing (the source code is available here). I was especially excited to see this paper, having recently read Michael Elad’s editorial on deep learning. To learn more, I interviewed Vlad (see below); I’ve lightly edited his responses for formatting and hyperlinks:

DGM: What is the origin story of this project? Were you and Paul inspired by the “Compressed sensing using generative models” paper?

VV: I have been working extensively with applied deep learning for the last year or so, and have been inspired by recent applications of deep generative image priors to classical inverse problems, such as the super resolution work by Fei Fei Li et al. Moreover, recent work on regularizing with deep generative priors for synthesizing the preferred inputs to neural activations, by Yosinski et al., made me optimistic that GAN-based generative priors are capturing sophisticated natural image structure (the synthetic images obtained in this paper look incredibly realistic).