Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 21

3.1.2 Conjugate gradient sampler for structured high‐dimensional Gaussians

The conjugate gradient (CG) sampler of Nishimura and Suchard [57] combined with their prior‐preconditioning technique overcomes this seemingly inevitable growth of the computational cost. Their algorithm is based on a novel application of the CG method [59, 60], which belongs to a family of iterative methods in numerical linear algebra. Despite its first appearance in 1952, CG received little attention for the next few decades, only making its way into major software packages such as MATLAB in the 1990s [61]. With its ability to solve a large and structured linear system via a small number of matrix–vector multiplications without ever explicitly inverting , however, CG has since emerged as an essential and prototypical algorithm for modern scientific computing [62, 63].

Despite its earlier rise to prominence in other fields, CG has not found practical applications in Bayesian computation until rather recently [57, 64]. We can offer at least two explanations for this. First, being an algorithm for solving a deterministic linear system, it is not obvious how CG would be relevant to Monte Carlo simulation, such as sampling from ; ostensively, such a task requires computing a “square root” of the precision matrix so that for . Secondly, unlike direct linear algebra methods, iterative methods such as CG have a variable computational cost that depends critically on the user's choice of a preconditioner and thus cannot be used as a “black‐box” algorithm.⁶ In particular, this novel application of CG to Bayesian computation is a reminder that other powerful ideas in other computationally intensive fields may remain untapped by the statistical computing community; knowledge transfers will likely be facilitated by having more researchers working at intersections of different fields.

Nishimura and Suchard [57] turns CG into a viable algorithm for Bayesian sparse regression problems by realizing that (i) we can obtain a Gaussian vector by first generating and and then setting and (ii) subsequently solving yields a sample from the distribution (4). The authors then observe that the mechanism through which a shrinkage prior induces sparsity of s also induces a tight clustering of eigenvalues in the prior‐preconditioned matrix . This fact makes it possible for prior‐preconditioned CG to solve the system in matrix–vector operations of form , where roughly represents the number of significant s that are distinguishable from zeros under the posterior. For having a structure as in (4), can be computed via matrix–vector multiplications of form and , so each operation requires a fraction of the computational cost of directly computing and then factorizing it.

Prior‐preconditioned CG demonstrates an order of magnitude speedup in posterior computation when applied to a comparative effectiveness study of atrial fibrillation treatment involving patients and covariates [57]. Though unexplored in their work, the algorithm's heavy use of matrix–vector multiplications provides avenues for further acceleration. Technically, the algorithm's complexity may be characterized as , for the matrix–vector multiplications by and , but the theoretical complexity is only a part of the story. Matrix–vector multiplications are amenable to a variety of hardware optimizations, which in practice can make orders of magnitude difference in speed (Section 4.2). In fact, given how arduous manually optimizing computational bottlenecks can be, designing algorithms so as to take advantage of common routines (as those in Level 3 BLAS) and their ready‐optimized implementations has been recognized as an effective principle in algorithm design [65].

Computational Statistics in Data Science

Подняться наверх