Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 24
4.1 Fast, Flexible, and Friendly Statistical Algo‐Ware
ОглавлениеTo accommodate the greatest range of models while remaining simple enough to encourage easy implementation, inference methods should rely solely on the quantities that can be computed algorithmically for any given model. The log‐likelihood (or log‐density in the Bayesian setting) is one such quantity, and one can employ the computational graph framework [77, 78] to evaluate conditional log‐likelihoods for any subset of model parameters as well as their gradients via backpropagation [79]. Beyond being efficient in terms of the first three Core Challenges, an algorithm should demonstrate robust performance on a reasonably wide range of problems without extensive tuning if it is to lend itself to successful software deployment.
HMC (Section 2.2) is a prominent example of a general‐purpose algorithm for Bayesian inference, only requiring the log‐density and its gradient. The generic nature of HMC has opened up possibilities for complex Bayesian modeling as early as Neal [80], but its performance is highly sensitive to model parameterization and its three tuning parameters, commonly referred to as trajectory length, step size, and mass matrix [27]. Tuning issues constitute a major obstacle to the wider adoption of the algorithm, as evidenced by the development history of the popular HMC‐based probabilistic programming software Stan [81], which employs the No‐U‐Turn sampler (NUTS) of Hoffman and Gelman [82] to make HMC user‐friendly by obviating the need to tune its trajectory length. Bayesian software packages such as Stan empirically adapt the remaining step size and mass matrix [83]; this approach helps make the use of HMC automatic though is not without issues [84] and comes at the cost of significant computational overhead.
Although HMC is a powerful algorithm that has played a critical role in the emergence of general‐purpose Bayesian inference software, the challenges involved in its practical deployment also demonstrate how an algorithm – no matter how versatile and efficient at its best – is not necessarily useful unless it can be made easy for practitioners to use. It is also unlikely that one algorithm works well in all situations. In fact, there are many distributions on which HMC performs poorly [83, 85, 86]. Additionally, HMC is incapable of handling discrete distributions in a fully general manner despite the progresses made in extending HMC to such situations [87, 88].
But broader applicability comes with its own challenges. Among sampling‐based approaches to Bayesian inference, the Gibbs sampler [89, 90] is, arguably, the most versatile of the MCMC methods. The algorithm simplifies the task of dealing with a complex multidimensional posterior distribution by factorizing the posterior into simpler conditional distributions for blocks of parameters and iteratively updating parameters from their conditionals. Unfortunately, the efficiency of an individual Gibbs sampler depends on its specific factorization and the degree of dependence between its blocks of parameters. Without a careful design or in the absence of effective factorization, therefore, Gibbs samplers' performance may lag behind alternatives such as HMC [91].
On the other hand, Gibbs samplers often require little tuning and can take advantage of highly optimized algorithms for each conditional update, as done in the examples of Section 3. A clear advantage of the Gibbs sampler is that it tends to make software implementation quite modular; for example, each conditional update can be replaced with the latest state‐of‐the‐art samplers as they appear [92], and adding a new feature may amount to no more than adding a single conditional update [75]. In this way, an algorithm may not work in a completely model‐agnostic manner but with a broad enough scope can serve as a valuable recipe or meta‐algorithm for building model‐specific algorithms and software. The same is true for optimization methods. Even though its “E”‐step requires a derivation (by hand) for each new model, the EM algorithm [93] enables maximum‐likelihood estimation for a wide range of models. Similarly, variational inference (VI) for approximate Bayes requires manual derivations but provides a general framework to turn posterior computation into an optimization problem [94]. As meta‐algorithms, both EM and VI expand their breadth of use by replacing analytical derivations with Monte Carlo estimators but suffer losses in statistical and computational efficiency [95, 96]. Indeed, such trade‐offs will continue to haunt the creation of fast, flexible, and friendly statistical algo‐ware well into the twenty‐first century.
