An intuition for concentration inequalities

I use concentration inequalities a lot in my research, and it’s about time I write a post about them. Throughout this post, $\{X_i\}$ will denote independent realizations of a common random variable $X$ taking values in the real line. We will take $\mu:=\mathbb{E}X<\infty$, $\sigma^2:=\mathrm{Var}(X)<\infty$, and

$\displaystyle{\overline{X}_n:=\frac{1}{n}\sum_{i=1}^nX_i.}$

The central limit theorem (CLT) essentially states that

$\displaystyle{\lim_{n\rightarrow\infty}\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)=2Q(t)\leq e^{-t^2/2} \qquad \forall t\geq0,}$

where the Q-function $Q(\cdot)$ denotes the tail probability of the standard normal distribution. Unfortunately, this only ensures pointwise convergence for each $t$, meaning CLT provides no uniform tail bound for any fixed $n$. The purpose of a concentration inequality is to provide such a bound.

A crucial ingredient for concentration phenomena is control over rare events. Indeed, if $X$ significantly deviates from its mean with sizable probability (where significance of deviation is measured relative to the standard deviation), then $X$ exhibits less concentration. For example, consider the lazy coin toss random variable:

$\displaystyle{X=\left\{\begin{array}{rcc}+1&\mbox{w.p.}&p/2\\ 0&\mbox{w.p.}&1-p\\ -1&\mbox{w.p.}&p/2 \end{array}\right.}$

Then $\mu=0$ and $\sigma^2=p$, and so taking $t=\sqrt{n}/\sigma$ gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)=\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq1\bigg)=p^n=e^{-(p\log(1/p))t^2}.}$

Notice that as $p$ gets small, $\sigma$ also gets small, and so any deviation from the mean becomes particularly large relative to $\sigma$. In such cases, any uniform tail bound of the form $\leq e^{-ct^2}$ (i.e., the subgaussian decay in $t$ we expect considering CLT) must have $c\leq p\log(1/p)$, signaling less concentration when $p$ is smaller.

Observe that this is not a problem when $X$ is distributed $N(\mu,\sigma^2)$, since in this case $\overline{X}_n$ is distributed $N(\mu,\sigma^2/n)$, and so

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)=2Q(t)\leq e^{-t^2/2} \qquad \forall t\geq0,}$

regardless of $n$. This suggests that the normal distribution exhibits enough decay to render any significant deviation sufficiently improbable. Considering CLT, for any $X$ and any corresponding tail bound $f(t)$ which is valid for arbitrarily large $n$, we necessarily have $f(t)\geq 2Q(t)$ for all $t\geq0$. As such, the Gaussian case is optimal since it allows $f=2Q$. In some sense, concentration inequalities leverage certain qualities that a given random variable $X$ has in common with the Gaussian.

Since random variables of finite variance satisfy CLT, it is perhaps not surprising that something can be said about concentration in this general case. Here, we note that $\mathrm{Var}(\overline{X}_n)=\sigma^2/n$, and so Chebyshev’s inequality gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq\frac{1}{t^2} \qquad \forall t>0.}$

Of course, the decay here is terrible compared to CLT, but this inequality is tight in some sense. Indeed, for every $t\geq1$, there exists a random variable $X$ such that equality above is achieved with $n=1$ (just take the lazy coin toss defined above with $p=1/(2t^2)$). Granted, concentration concerns larger values of $n$, but as we saw with the lazy coin toss, one cannot expect a subgaussian tail bound $\leq e^{-ct^2}$ with a fixed $c$ without further assumptions.

— Confined to an interval —

Since we want to control rare events, it is reasonable to consider random variables $X$ for which there exists an interval $I$ of width $w$ such that $X\in I$ almost surely. Intuitively, it will be important to recognize how big $w$ is relative to $\sigma$, but for now, we will consider a concentration bound in terms of $w$. To this end, Hoeffding’s inequality gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{sw}{\sqrt{n}}\bigg)\leq2e^{-2s^2} \qquad \forall s\geq0.}$

We seek finite-$n$ analogs to CLT, and so we wish to replace $w$ with $\sigma$. First, note that $I=[-w/2,w/2]$ without loss of generality, in which case

$\sigma^2\leq\mathbb{E}X^2\leq\mathrm{ess\,sup}_{\omega\in\Omega}X(\omega)^2=(w/2)^2,$

meaning $\sigma\leq w/2$. This is tight considering the random variable which is $\pm w/2$ with equal probabilities. Now consider any random variable $X$ such that $\rho:=\sigma/w\in(0,1/2]$. Then taking $t=s/\rho$ gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq2e^{-2\rho^2t^2} \qquad \forall t\geq0.}$

As $\rho$ gets small, $w$ is large relative to $\sigma$, meaning significant deviations may be probable, thereby leading to less concentration (indeed, the coefficient of $t^2$ in the tail bound is smaller).

Another bound, called Bernstein’s inequality, does a much better job of accounting for $\rho$:

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq2\mathrm{exp}\bigg(-\frac{t^2/2}{1+t/(3\rho\sqrt{n})}\bigg) \qquad \forall t\geq0.}$

This bound is easier on the eyes after applying the estimate $1+t/(3\rho\sqrt{n})\leq2\max\{1,t/(3\rho\sqrt{n})\}$ and changing variables for the latter case of the maximization:

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq2e^{-t^2/4} \qquad \forall t\in[0,3\rho\sqrt{n}],}$

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\epsilon w\bigg)\leq2e^{-3\epsilon n/4} \qquad \forall \epsilon\in[3\rho^2,1].}$

Note that $t=3\rho\sqrt{n}$ corresponds to $\epsilon=3\rho^2$. While Hoeffding did a terrible job of producing good decay when $\rho=\sigma/w$ is small, Bernstein says that a small $\rho$ can be overcome by taking $n\gg t^2/\rho^2$. However, larger values of $t$ do not enjoy the subgaussian tail bound; such values of $t$ make the deviation $|\overline{X}_n-\mu|$ comparable to the width $w$ of $I$ (which is the maximum possible size of the deviation). To see that the subgaussian tail is not possible for these larger values of $t$, consider the lazy coin toss with $p=1/n$ and $t=\sqrt{n}/\sigma=n$:

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)=p^n=e^{-n\log n},}$

which is $\gg e^{-ct^2}$ for any fixed $c$. (This doesn’t contradict Hoeffding since $\rho$ vanishes as $n$ gets large in this case.) Overall, Bernstein is essentially two bounds, each acting in a different regime. The second regime is comparable to Hoeffding, which equivalently gives $\leq 2e^{-2\epsilon^2 n}$. In fact, Hoeffding is stronger than Bernstein when $\epsilon\in(3/8,1]$, though this regime is perhaps less interesting. The following figure illustrates the regimes over which the different bounds hold:

Specifically, the first part of Bernstein is given in red, the second part is given in green, and Hoeffding is given in blue. The solid portions on the lines indicate which bounds hold over which regimes. Observe that Bernstein improves over Hoeffding by establishing more concentration over a slightly smaller range of $t$ values.

Oftentimes, one is dealt an extreme version of a bounded random variable: $X$ is supported on the boundary of an interval. Without loss of generality, $X$ takes values in $\{0,1\}$, and if $p$ is the probability that $X=1$, we may apply Bernstein with $\mu=p$ and $\sigma^2=p(1-p)$. Alternatively, we may apply the Chernoff bound, which behaves similarly, but is sharper.

— Truncation —

What if $X$ isn’t confined to any interval, but its distribution decays rapidly? In this case, one may use a trick called truncation. We will illustrate this trick with an example.

Suppose $X$ has the following Laplace distribution function:

$f(x)=\frac{1}{2}e^{-|x|}.$

Then $\mu=0$ and $\sigma^2=2$. Let’s truncate this distribution at some threshold $\tau\geq3$ to be specified later. The “small” portion will be

$\displaystyle{S:=\left\{\begin{array}{ll}X&\mbox{if }|X|\leq\tau\\ 0&\mbox{otherwise,}\end{array}\right.}$

and the “large” portion will be $L:=X-S$. Clearly, $\mathbb{E}S=\mathbb{E}L=0$, and it’s straightforward to verify that

$\displaystyle{\mathrm{Var}(S)=\mathbb{E}S^2=\int_0^\tau x^2e^{-x}dx\in(1,2),}$

and

$\displaystyle{\mathrm{Var}(L)=\mathbb{E}L^2=\int_\tau^\infty x^2e^{-x}dx\leq 2\tau^2 e^{-\tau}.}$

Since $|\overline{X}_n-\mu|\leq|\overline{S}_n|+|\overline{L}_n|$, the union bound gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq\mathrm{Pr}\bigg(\big|\overline{S}_n\big|\geq\frac{\sqrt{2}t}{2\sqrt{n}}\bigg)+\mathrm{Pr}\bigg(\big|\overline{L}_n\big|\geq\frac{\sqrt{2}t}{2\sqrt{n}}\bigg).}$

By Bernstein, the first term is $\leq2e^{-t^2/8}$ provided $t\leq 3\rho\sqrt{n}$, and since $\rho\geq1/(2\tau)$, this occurs when $t\leq 3\sqrt{n}/(2\tau)$. We use Chebyshev to conclude that the second term is $\leq(4\tau^2/t^2)e^{-\tau}\leq(4/t^2)e^{-\tau/4}$. At this point, picking $\tau=n^{1/3}$ gives

$\displaystyle{\mathrm{Pr}\bigg(\big|\overline{X}_n-\mu\big|\geq\frac{t\sigma}{\sqrt{n}}\bigg)\leq\bigg(2+\frac{4}{t^2}\bigg)e^{-t^2/9} \qquad \forall t\in(0,3n^{1/6}/2).}$

This is not necessarily the best concentration inequality one can derive for this distribution, but considering CLT, it is certainly optimal up to constants when $1\ll t\ll n^{1/6}$.