Although HMC is a powerful algorithm that has played a critical role in the emergence of general‐purpose Bayesian inference software, the challenges involved in its practical deployment also demonstrate how an algorithm – no matter how versatile and efficient at its best – is not necessarily useful unless it can be made easy for practitioners to use. It is also unlikely that one algorithm works well in all situations. In fact, there are many distributions on which HMC performs poorly [83, 85, 86]. Additionally, HMC is incapable of handling discrete distributions in a fully general manner despite the progresses made in extending HMC to such situations [87, 88].
But broader applicability comes with its own challenges. Among sampling‐based approaches to Bayesian inference, the Gibbs sampler [89, 90] is, arguably, the most versatile of the MCMC methods. The algorithm simplifies the task of dealing with a complex multidimensional posterior distribution by factorizing the posterior into simpler conditional distributions for blocks of parameters and iteratively updating parameters from their conditionals. Unfortunately, the efficiency of an individual Gibbs sampler depends on its specific factorization and the degree of dependence between its blocks of parameters. Without a careful design or in the absence of effective factorization, therefore, Gibbs samplers' performance may lag behind alternatives such as HMC [91].
On the other hand, Gibbs samplers often require little tuning and can take advantage of highly optimized algorithms for each conditional update, as done in the examples of Section 3. A clear advantage of the Gibbs sampler is that it tends to make software implementation quite modular; for example, each conditional update can be replaced with the latest state‐of‐the‐art samplers as they appear [92], and adding a new feature may amount to no more than adding a single conditional update [75]. In this way, an algorithm may not work in a completely model‐agnostic manner but with a broad enough scope can serve as a valuable recipe or meta‐algorithm for building model‐specific algorithms and software. The same is true for optimization methods. Even though its “E”‐step requires a derivation (by hand) for each new model, the EM algorithm [93] enables maximum‐likelihood estimation for a wide range of models. Similarly, variational inference (VI) for approximate Bayes requires manual derivations but provides a general framework to turn posterior computation into an optimization problem [94]. As meta‐algorithms, both EM and VI expand their breadth of use by replacing analytical derivations with Monte Carlo estimators but suffer losses in statistical and computational efficiency [95, 96]. Indeed, such trade‐offs will continue to haunt the creation of fast, flexible, and friendly statistical algo‐ware well into the twenty‐first century.
4.2 Hardware‐Optimized Inference
But successful statistical inference software must also interact with computational hardware in an optimal manner. Growing datasets require the computational statistician to give more and more thought to how the computer implements any statistical algorithm. To effectively leverage computational resources, the statistician must (i) identify the routine's computational bottleneck (Section 2.1) and (ii) algorithmically map this rate‐limiting step to available hardware such as a multicore or vectorized CPU, a many‐core GPU, or – in the future – a quantum computer. Sometimes, the first step is clear theoretically: a naive implementation of the high‐dimensional regression example of Section 3.1 requires an order
Multicore CPU processing is effective for parallel completion of multiple, mostly independent tasks that do not require intercommunication. One might generate 2 to, say, 72 independent Markov chains on a desktop computer or shared cluster. A positive aspect is that the tasks do not have to involve the same instruction sets at all; a negative is latency, that is, that the slowest process dictates overall runtime. It is possible to further speed up CPU computing with single instruction, multiple data (SIMD) or vector processing. A small number of vector processing units (VPUs) in each CPU core can carry out a single set of instructions on data stored within an extended‐length register. Intel's streaming SIMD extensions (SSE), advance vector extensions (AVX), and AVX‐512 allow operations on 128‐, 256‐, and 512‐bit registers. In the context of 64‐bit double precision, theoretical speedups for SSE, AVX, and AVX‐512 are two‐, four‐, and eightfold. For example, if a computational bottleneck exists within a for‐loop, one can unroll the loop and perform operations on, say, four consecutive loop bodies at once using AVX [21, 22]. Conveniently, languages such as OPENMP [97] make SIMD loop optimization transparent to the user [98]. Importantly, SIMD and multicore optimization play well together, providing multiplicative speedups.
While a CPU may have tens of cores, GPUs accomplish fine‐grained parallelization with thousands of cores that apply a single instruction set to distinct data within smaller workgroups of tens or hundreds of cores. Quick communication and shared cache memory within each workgroup balance full parallelization across groups, and dynamic on‐ and off‐loading of the many tasks hide the latency that is so problematic for multicore computing. Originally designed for efficiently parallelized matrix math calculations arising from image rendering and transformation, GPUs easily speed up tasks that are tensor multiplication intensive such as deep learning [99] but general‐purpose GPU applications abound. Holbrook et al. [21] provide a larger review of parallel computing within computational statistics. The same paper reports a GPU providing 200‐fold speedups over single‐core processing and 10‐fold speedups over 12‐core AVX processing for likelihood and gradient calculations while sampling from a Bayesian multidimensional scaling posterior using HMC at scale. Holbrook et al. [22] report similar speedups for inference based on spatiotemporal Hawkes processes. Neither application involves matrix or tensor manipulations.
A quantum computer acts on complex data vectors of magnitude 1 called qubits with gates that are mathematically equivalent to unitary operators [100]. Assuming that engineers overcome the tremendous difficulties involved in building a practical quantum computer (where practicality entails simultaneous use of many quantum gates with little additional noise), twenty‐first century statisticians might have access to quadratic or even exponential speedups for extremely specific statistical tasks. We are particularly interested in the following four quantum algorithms: quantum search [101], or finding a single 1 amid a collection of 0s, only requires