ellipsis comma x Subscript upper N Baseline right-parenthesis"/>. We specify our model for the data with a likelihood function
and use a prior distribution with density function
to characterize our belief about the value of the
‐dimensional parameter vector
a priori. The target of Bayesian inference is the posterior distribution of
conditioned on
(1)
The denominator's multidimensional integral quickly becomes impractical as grows large, so we choose to use the MetropolisHastings (M–H) algorithm to generate a Markov chain with stationary distribution [19, 20]. We begin at an arbitrary position and, for each iteration , randomly generate the proposal state from the transition distribution with density . We then accept proposal state with probability
(2)
The ratio on the right no longer depends on the denominator in Equation (1), but one must still compute the likelihood and its terms .
It is for this reason that likelihood evaluations are often the computational bottleneck for Bayesian inference. In the best case, these evaluations are , but there are many situations in which they scale [21, 22] or worse. Indeed, when is large, it is often advantageous to use more advanced MCMC algorithms that use the gradient of the log‐posterior to generate better proposals. In this situation, the log‐likelihood gradient may also become a computational bottleneck [21].
2.2 Big P
One of the simplest models for big problems is ridge regression [23], but computing can become expensive even in this classical setting. Ridge regression estimates the coefficient by minimizing the distance between the observed and predicted values and along with a weighted square norm of :
For illustrative purposes, we consider the following direct method for computing .4 We can first multiply the design matrix by its transpose at the cost of and subsequently invert the matrix at the cost of . The total complexity shows that (i) a large number