This is the third entry to summarize talks in the “boot camp” week of the program on Foundations of Data Science at the Simons Institute for the Theory of Computing, continuing this post. On Wednesday, we heard talks from Fred Roosta and Will Fithian. Below, I link videos and provide brief summaries of their talks.
Fred Roosta — Stochastic Second-Order Optimization Methods
While first-order optimization makes use of (an approximation to) the gradient (e.g., gradient descent), second-order optimization also makes use of (an approximation to) the Hessian (e.g., Newton’s method). For second-order optimization, the iterations converge faster, but the per-iteration cost is greater. Even gradient descent is time prohibitive in the “big data” regime, since the objective F(x) to minimize is a sum of so many components . Recall that stochastic gradient descent estimates the gradient at each iteration by sampling the components. One may also sample components to estimate the Hessian. In fact, you can provably get away with each (smooth) component being convex provided the overall objective is strongly convex. See this paper for more information. In the convex but non-smooth case, you can apply a proximal Newton-type method (see this paper), and in the convex but semi-smooth case, this can be modified using Clarke’s generalized Jacobian. (I’m not sure how much is known about stochastic versions of these methods.)
If you haven’t seen it, check out the secant method for approximating roots of a function. The idea is to modify Newton’s root-finding algorithm by replacing derivatives with finite differences of successive iterations, producing cheaper iterations at the expense of the convergence rate (the golden ratio appears!). Quasi-Newton methods use this same idea to speed up Newton’s method, the most popular variant being BFGS. Other second-order methods leverage other estimates of the Hessian. In scientific computing, the objective function frequently takes the form , where f is convex. In this case, the Hessian of F can be approximated as , where H is the Hessian of f and J is the Jacobian of h; this approximation is used in Gauss–Newton. In machine learning, it is apparently common to use a Fisher information matrix as a proxy to the Hessian, resulting in an algorithm called natural gradient descent.
At the end, Fred posed an open problem: What is the average-case complexity (or smoothed complexity) of global convergence of Newton’s method with a line search? (He observes that real-world instances are always much faster than the standard worst-case guarantees suggest.)
Will Fithian — Statistical Inference
The purpose of this talk was to give a broad overview of the basics of statistical inference. Given a random variable from some parameterized distribution, an estimate is a function of the random variable that is intended to be close to some function of the distribution’s parameter. This closeness is measured by a loss function (such as the square of the difference), and the expected value of this loss (computed over the measure determined by a given choice of parameter) is called risk. Each estimator determines a risk function that varies with the parameter. To compare estimators, one may summarize this function with a scalar, either by averaging over a prior distribution over the parameter space (resulting in Bayes risk), or by computing the worst-case risk (called minimax risk). Another way to select an estimator is to restrict to unbiased estimators, which have the property that the expected value of the estimator equals the desired quantity. The modern approach is manage bias instead focusing on unbiased estimators, since by the bias–variance tradeoff, one may obtain a low-variance alternative. The Bayesian approach to estimation incorporates a prior “belief” of what the parameter is, and the minimizer of Bayes risk (in terms of the MSE loss) is simply the posterior mean. In some applications, it is difficult to identify a useful prior.
When designing an estimator for a hypothesis test, the first priority is to minimize Type I error in the worst case (quantified by the significance level), and the second priority is to minimize Type II error. Minimizing Type II error is equivalent to maximizing power, but the power should be thought of as a function of the true parameter. Will pointed out that confidence intervals are more informative than hypothesis tests, and you can view it as the outcome of infinitely many simultaneous hypothesis tests. This can be phrased in terms of a duality between confidence sets and hypothesis tests. Will concluded by discussing the maximum likelihood estimator, sketching an argument for its consistency and asymptotic normality.