# A relaxation of deep learning?

Jesse Peterson and I recently arxiv’d our paper for Wavelets and Sparsity XVI at SPIE this year. This paper focuses on learning functions $f\colon\{\pm1\}^n\rightarrow\{\pm1\}$ of the form

$\displaystyle{f(x_1,\ldots,x_n) = \mathrm{sign}\bigg(\sum_{i=1}^ka_i\prod_{j\in S_i}x_j\bigg)}, \qquad (*)$

where $k$ is small, $a_1,\ldots,a_k\in\mathbb{R}$, and $S_1,\ldots,S_k\subseteq\{1,\ldots,n\}$. Notice that any such Boolean function can be viewed as a labeling function of strings of $n$ bits, and so learning the function from labeled instances amounts to a binary classification problem.

If we identify $\{\pm1\}^n$ with $\mathbb{Z}_2^n$, then the $a_i$‘s are essentially the big entries of the Walsh–Hadamard transform of $f$, and these entries are indexed by the $S_i$‘s. As such, functions of the form $(*)$ are essentially the Boolean functions of concentrated spectra. These functions have been shown to well approximate the Boolean functions with sufficiently simple circuit implementations (e.g., see one, two, three), and given the strong resemblance between Boolean circuits and neural networks, the following hypothesis seems plausible:

Hypothesis. Boolean functions that are well approximated by learnable deep neural networks also enjoy a highly concentrated Walsh–Hadamard transform.

This hypothesis motivated our study, as it suggests that one doesn’t need to learn individual neurons, or even guess the underlying structure of the network (which has become quite an art form), but rather find a function of the form $(*)$ that matches the training set. Of course, this won’t work unless these functions satisfy a few of miracles:

1. Simple. The set $C_{n,k}$ of functions of the form $(*)$ needs to be sufficiently simple in order to generalize from the training set. To address this, we prove that the VC dimension of $C_{n,k}$ is $O(nk)$.

2. Admits fast optimization. We need a fast algorithm that finds a member of $C_{n,k}$ that fits a given training set. This problem amounts to a mixture between the sparse fast Walsh–Hadamard transform and one-bit compressed sensing. Finding a scalable algorithm remains open, but we propose an algorithm that seems to do okay in small dimensions.

3. Models reality. We need $C_{n,k}$ to fit real-world labeling functions. Considering the broad applicability of deep neural networks, it suffices for our hypothesis to hold. In the paper, we also ran a convincing experiment, which I discuss below.

To test part 3 above, we applied our algorithm to the MNIST database of handwritten digits. Since our algorithm is slower in higher dimensions, we first processed the database by blurring, downsampling, and thresholding to produce $5\times 5$ images of on-and-off pixels:

Next, we ran our algorithm to find a member of $C_{25,150}$ that fits a training set of 8000 different zeros and ones, and then we tested this function on another collection of 3800 zeros and ones. We ended up with a misclassification rate of 0.74% after 160 seconds of training. The following depicts the 28 misclassified digits:

Can you distinguish the zeros from the ones? (See the paper for the solution.)

This experiment suggests a few conclusions. First, the function class $C_{n,k}$ meets requirements 1 and 3 from above in practice. Considering the $5\times 5$ depictions of misclassified digits, the classifier probably needs more resolution before it can perform better. This in turn requires an algorithm that scales better with $n$. As such, it remains to tackle requirement 2 from above, that is, we need faster algorithms. See the conclusion of the paper for a few more open problems.