A relaxation of deep learning?

Jesse Peterson and I recently arxiv’d our paper for Wavelets and Sparsity XVI at SPIE this year. This paper focuses on learning functions f\colon\{\pm1\}^n\rightarrow\{\pm1\} of the form

\displaystyle{f(x_1,\ldots,x_n) = \mathrm{sign}\bigg(\sum_{i=1}^ka_i\prod_{j\in S_i}x_j\bigg)}, \qquad (*)

where k is small, a_1,\ldots,a_k\in\mathbb{R}, and S_1,\ldots,S_k\subseteq\{1,\ldots,n\}. Notice that any such Boolean function can be viewed as a labeling function of strings of n bits, and so learning the function from labeled instances amounts to a binary classification problem.

If we identify \{\pm1\}^n with \mathbb{Z}_2^n, then the a_i‘s are essentially the big entries of the Walsh–Hadamard transform of f, and these entries are indexed by the S_i‘s. As such, functions of the form (*) are essentially the Boolean functions of concentrated spectra. These functions have been shown to well approximate the Boolean functions with sufficiently simple circuit implementations (e.g., see one, two, three), and given the strong resemblance between Boolean circuits and neural networks, the following hypothesis seems plausible:

Hypothesis. Boolean functions that are well approximated by learnable deep neural networks also enjoy a highly concentrated Walsh–Hadamard transform.

This hypothesis motivated our study, as it suggests that one doesn’t need to learn individual neurons, or even guess the underlying structure of the network (which has become quite an art form), but rather find a function of the form (*) that matches the training set. Of course, this won’t work unless these functions satisfy a few of miracles:

1. Simple. The set C_{n,k} of functions of the form (*) needs to be sufficiently simple in order to generalize from the training set. To address this, we prove that the VC dimension of C_{n,k} is O(nk).

2. Admits fast optimization. We need a fast algorithm that finds a member of C_{n,k} that fits a given training set. This problem amounts to a mixture between the sparse fast Walsh–Hadamard transform and one-bit compressed sensing. Finding a scalable algorithm remains open, but we propose an algorithm that seems to do okay in small dimensions.

3. Models reality. We need C_{n,k} to fit real-world labeling functions. Considering the broad applicability of deep neural networks, it suffices for our hypothesis to hold. In the paper, we also ran a convincing experiment, which I discuss below.

To test part 3 above, we applied our algorithm to the MNIST database of handwritten digits. Since our algorithm is slower in higher dimensions, we first processed the database by blurring, downsampling, and thresholding to produce 5\times 5 images of on-and-off pixels:

bits1

Next, we ran our algorithm to find a member of C_{25,150} that fits a training set of 8000 different zeros and ones, and then we tested this function on another collection of 3800 zeros and ones. We ended up with a misclassification rate of 0.74% after 160 seconds of training. The following depicts the 28 misclassified digits:

bits2

Can you distinguish the zeros from the ones? (See the paper for the solution.)

This experiment suggests a few conclusions. First, the function class C_{n,k} meets requirements 1 and 3 from above in practice. Considering the 5\times 5 depictions of misclassified digits, the classifier probably needs more resolution before it can perform better. This in turn requires an algorithm that scales better with n. As such, it remains to tackle requirement 2 from above, that is, we need faster algorithms. See the conclusion of the paper for a few more open problems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s