I use concentration inequalities a lot in my research, and it’s about time I write a post about them. Throughout this post, will denote independent realizations of a common random variable taking values in the real line. We will take , , and
The central limit theorem (CLT) essentially states that
where the Q-function denotes the tail probability of the standard normal distribution. Unfortunately, this only ensures pointwise convergence for each , meaning CLT provides no uniform tail bound for any fixed . The purpose of a concentration inequality is to provide such a bound.
A crucial ingredient for concentration phenomena is control over rare events. Indeed, if significantly deviates from its mean with sizable probability (where significance of deviation is measured relative to the standard deviation), then exhibits less concentration. For example, consider the lazy coin toss random variable:
Then and , and so taking gives
Notice that as gets small, also gets small, and so any deviation from the mean becomes particularly large relative to . In such cases, any uniform tail bound of the form (i.e., the subgaussian decay in we expect considering CLT) must have , signaling less concentration when is smaller.
Observe that this is not a problem when is distributed , since in this case is distributed , and so
regardless of . This suggests that the normal distribution exhibits enough decay to render any significant deviation sufficiently improbable. Considering CLT, for any and any corresponding tail bound which is valid for arbitrarily large , we necessarily have for all . As such, the Gaussian case is optimal since it allows . In some sense, concentration inequalities leverage certain qualities that a given random variable has in common with the Gaussian.
— No additional assumptions —
Since random variables of finite variance satisfy CLT, it is perhaps not surprising that something can be said about concentration in this general case. Here, we note that , and so Chebyshev’s inequality gives
Of course, the decay here is terrible compared to CLT, but this inequality is tight in some sense. Indeed, for every , there exists a random variable such that equality above is achieved with (just take the lazy coin toss defined above with ). Granted, concentration concerns larger values of , but as we saw with the lazy coin toss, one cannot expect a subgaussian tail bound with a fixed without further assumptions.
— Confined to an interval —
Since we want to control rare events, it is reasonable to consider random variables for which there exists an interval of width such that almost surely. Intuitively, it will be important to recognize how big is relative to , but for now, we will consider a concentration bound in terms of . To this end, Hoeffding’s inequality gives
We seek finite- analogs to CLT, and so we wish to replace with . First, note that without loss of generality, in which case
meaning . This is tight considering the random variable which is with equal probabilities. Now consider any random variable such that . Then taking gives
As gets small, is large relative to , meaning significant deviations may be probable, thereby leading to less concentration (indeed, the coefficient of in the tail bound is smaller).
Another bound, called Bernstein’s inequality, does a much better job of accounting for :
This bound is easier on the eyes after applying the estimate and changing variables for the latter case of the maximization:
Note that corresponds to . While Hoeffding did a terrible job of producing good decay when is small, Bernstein says that a small can be overcome by taking . However, larger values of do not enjoy the subgaussian tail bound; such values of make the deviation comparable to the width of (which is the maximum possible size of the deviation). To see that the subgaussian tail is not possible for these larger values of , consider the lazy coin toss with and :
which is for any fixed . (This doesn’t contradict Hoeffding since vanishes as gets large in this case.) Overall, Bernstein is essentially two bounds, each acting in a different regime. The second regime is comparable to Hoeffding, which equivalently gives . In fact, Hoeffding is stronger than Bernstein when , though this regime is perhaps less interesting. The following figure illustrates the regimes over which the different bounds hold:
Specifically, the first part of Bernstein is given in red, the second part is given in green, and Hoeffding is given in blue. The solid portions on the lines indicate which bounds hold over which regimes. Observe that Bernstein improves over Hoeffding by establishing more concentration over a slightly smaller range of values.
Oftentimes, one is dealt an extreme version of a bounded random variable: is supported on the boundary of an interval. Without loss of generality, takes values in , and if is the probability that , we may apply Bernstein with and . Alternatively, we may apply the Chernoff bound, which behaves similarly, but is sharper.
— Truncation —
What if isn’t confined to any interval, but its distribution decays rapidly? In this case, one may use a trick called truncation. We will illustrate this trick with an example.
Suppose has the following Laplace distribution function:
Then and . Let’s truncate this distribution at some threshold to be specified later. The “small” portion will be
and the “large” portion will be . Clearly, , and it’s straightforward to verify that
Since , the union bound gives
By Bernstein, the first term is provided , and since , this occurs when . We use Chebyshev to conclude that the second term is . At this point, picking gives
This is not necessarily the best concentration inequality one can derive for this distribution, but considering CLT, it is certainly optimal up to constants when .