Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 16
2.2 Big P
ОглавлениеOne of the simplest models for big problems is ridge regression [23], but computing can become expensive even in this classical setting. Ridge regression estimates the coefficient by minimizing the distance between the observed and predicted values and along with a weighted square norm of :
For illustrative purposes, we consider the following direct method for computing .4 We can first multiply the design matrix by its transpose at the cost of and subsequently invert the matrix at the cost of . The total complexity shows that (i) a large number of parameters is often sufficient for making even the simplest of tasks infeasible and (ii) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder [12].
In the context of Bayesian inference, the length of the vector dictates the dimension of the MCMC state space. For the M‐H algorithm (Section 2.1) with ‐dimensional Gaussian target and proposal, Gelman et al. [25] show that the proposal distribution's covariance should be scaled by a factor inversely proportional to . Hence, as the dimension of the state space grows, it behooves one to propose states that are closer to the current state of the Markov chain, and one must greatly increase the number of MCMC iterations. At the same time, an increasing often slows down rate‐limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide applicability of latent variable models [26] (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., ) means M‐H simply does not work for a huge class of models popular with practitioners.
For these reasons, Hamiltonian Monte Carlo (HMC) [27] has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M‐H, HMC uses an accept step (Equation 2). Unlike M‐H, HMC takes advantage of additional information about the target distribution in the form of the log‐posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian “momentum” variable independent to the “position” variable . The constructed Hamiltonian system has energy function given by the negative logarithm of the joint distribution
and we produce proposals by simulating the system according to Hamilton's equations
Thus, the momentum of the system moves in the direction of the steepest ascent for the log‐posterior, forming an analogy with first‐order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters [21, 28]. The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when is large [29].