# Recent advances in mathematical data science

Part of the experience of giving a talk at Oberwolfach is documentation. First, they ask you to handwrite the abstract of your talk into a notebook of sorts for safekeeping. Later, they ask you to tex up an extended abstract for further documentation. This time, I gave a longer version of my SPIE talk (here are the slides). Since I posted my extended abstract on my blog last time, I figured I’d do it again:

This talk describes recent work on three different problems of interest in mathematical data science, namely, compressive classification, $k$-means clustering, and deep learning. (Based on three papers: one, two, three.)

First, compressive classification is a problem that comes on the heels of compressive sensing. In compressive sensing, one exploits the underlying structure of a signal class in order to exactly reconstruct any signal from the class given very few linear measurements of the signal. However, many applications do not require an exact reconstruction of the image, but rather a classification of that image (for example, is this a picture of a cat, or of a dog?). As such, it makes intuitive sense that the classification task might succeed given far fewer measurements than are necessary for compressive sensing.

Much like compressive sensing, compressive classification must exploit some notion of simplicity of the data set. For this talk, we consider data sets which are linearly separable, that is, one may distinguish two classes of points $A,B\subseteq\mathbb{R}^n$ by thresholding an inner product:

$x\in A \quad \Longrightarrow \quad \langle x,v\rangle < \theta,$

$x\in B \quad \Longrightarrow \quad \langle x,v\rangle > \theta.$

In particular, given sets $A$ and $B$ and tolerance $\eta$, we seek the smallest $m$ such that $PA$ and $PB$ are linearly separable for a random $m\times n$ matrix $P$ with probability $\geq1-\eta$. This establishes a minimal number of measurements necessary to classify in the compressed domain. To approach this problem, we note that $PA$ and $PB$ are linearly separable if and only if the null space of $P$ trivially intersects the cone generated by the Minkowski difference $A-B$. As such, we may leverage Gordon’s theory of escape through a mesh, or more recently, the approximate kinematic formula from conic integral geometry. With these tools, we find estimates in the special cases where $A$ and $B$ are Euclidean balls, and more generally, when they are ellipsoids (see this paper for details).

The second problem we discuss is $k$-means clustering. Here, given a point cloud $P\subseteq\mathbb{R}^n$, one is asked to partition the points into $k$ clusters $C_1,\ldots,C_k$ such that the corresponding cluster centers exhibit the smallest possible sum of squared error:

$\displaystyle{\min \quad \sum_{t=1}^k\sum_{x\in C_t}\bigg\|x-\frac{1}{|C_t|}\sum_{y\in C_t}y\bigg\|^2 \quad \text{s.t.} \quad C_1\sqcup\cdots\sqcup C_k=P}$

Unfortunately, minimizing the so-called $k$-means objective is NP-hard in general. However, Lloyd’s algorithm, which alternates between finding cluster centers for a proto-partition and reassigning points to the nearest cluster center, performs well in practice. Unfortunately, there is currently no guarantee that such an algorithm finds the $k$-means-optimal clustering.

In this talk, we consider a semidefinite relaxation of the $k$-means problem. Denoting $p=|P|$, we define the $p\times p$ matrix $D$ by $D_{ij}:=\|x_i-x_j\|^2$. Next, letting $1_{C_t}$ denote the $p$-dimensional indicator vector of $C_t$, then a straightforward manipulation gives

$\displaystyle{\sum_{t=1}^k\sum_{x\in C_t}\bigg\|x-\frac{1}{|C_t|}\sum_{y\in C_t}y\bigg\|^2 = \frac{1}{2}\mathrm{Tr}\bigg(D\sum_{t=1}^k\frac{1}{|C_t|}1_{C_t}1_{C_t}^\top\bigg).}$

Writing $X:=\sum_{t=1}^k\frac{1}{|C_t|}1_{C_t}1_{C_t}^\top$, we identify several convex constraints that $X$ satisfies, and we may optimize subject to these constraints in order to produce a polytime-solvable program:

$\min\quad\mathrm{Tr}(DX)\quad\text{s.t.}\quad\mathrm{Tr}(X)=k,~X1=1,~X\geq0,~X\succeq0.$

In general, we can expect the value of this relaxed program to be smaller than the value of the original $k$-means program. However, if the relaxed optimizer happens to be integral, meaning the optimal $X$ has the form $\sum_{t=1}^k\frac{1}{|C_t|}1_{C_t}1_{C_t}^\top$, then we may conclude that the corresponding clustering $C_1,\ldots,C_k$ is $k$-means optimal. Amazingly, when the data is drawn randomly from a reasonable model (the so-called stochastic ball model, introduced in this paper), one may show that the relaxed optimizer is integral with high probability (see this paper and that paper). It remains to find faster-than-SDP algorithms which enjoy a similar performance guarantee.

The last problem we consider is deep learning. Today, deep learning is the state-of-the-art technique for several important classification tasks. For each of these problems, given a labeled training set, one is tasked with producing a labeling function that not only matches the training set, but is also simple enough to generalize well to a test set. For deep learning, the labeling function is implemented by a deep neural network, which amounts to a large circuit of neurons, where each neuron linearly combines the outputs of its parent neurons, and then outputs a nonlinear function of this combination. To learn this labeling function, practitioners locally optimize the function parameters so as to fit the training set. To date, there is little theory to explain why this should perform as well as it does in practice.

Our approach to deep learning is motivated by an analogy with Boolean circuits. These circuits are similar to neural nets, except neurons are replaced by Boolean gates (such as AND, OR, or threshold gates). Interestingly, Boolean circuits of simple structure tend to implement Boolean functions $f\colon\{\pm1\}^n\rightarrow\{\pm1\}$ of concentrated spectra (see this paper and that paper), that is, $\{a_S\}_{S\subseteq[n]}$ is nearly sparse, where

$\displaystyle{f(x_1,\ldots,x_n)=\sum_{S\subseteq[n]}a_S\prod_{i\in S}x_i\qquad\forall (x_1,\ldots,x_n)\in\{\pm1\}^n.}$

Passing through the analogy, we hypothesize that learnable neural nets are necessarily simple, and furthermore, Boolean functions that are well approximated by such neural nets necessarily have concentrated spectra. If this hypothesis is true, then the deep learning problem can be relaxed to a sparse approximation problem. We test this hypothesis by classifying the zeros and ones in the MNIST database of handwritten digits by way of sparse approximation (indeed, deep neural nets currently hold the record for classifying MNIST digits). In our experiment, we managed to obtain a misclassification rate of 0.74%, thereby proving the concept of classification by sparse approximation (see this paper for details). Unfortunately, our naive method of sparse approximation does not scale well, and so it remains to find a scalable alternative to implement on other instances of binary classification.