In the context of Bayesian inference, the length
For these reasons, Hamiltonian Monte Carlo (HMC) [27] has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M‐H, HMC uses an accept step (Equation 2). Unlike M‐H, HMC takes advantage of additional information about the target distribution in the form of the log‐posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian “momentum” variable
and we produce proposals by simulating the system according to Hamilton's equations
Thus, the momentum of the system moves in the direction of the steepest ascent for the log‐posterior, forming an analogy with first‐order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters [21, 28]. The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when
2.3 Big M
Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].
Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big
where