Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 22

3.2 Phylogenetic Reconstruction

Оглавление

While big and big regression adapts a classical statistical task to contemporary needs, the twenty‐first century is witnessing the application of computational statistics to the entirety of applied science. One such example is the tracking and reconstruction of deadly global viral pandemics. Molecular phylogenetics has become an essential analytical tool for understanding the complex patterns in which rapidly evolving pathogens propagate throughout and between countries, owing to the complex travel and transportation patterns evinced by modern economies [66], along with other factors such as increased global population and urbanization [67]. The advance in sequencing technology is generating pathogen genomic data at an ever‐increasing pace, with a trend to real time that requires the development of computational statistical methods that are able to process the sequences in a timely manner and produce interpretable results to inform national/global public health organizations.

The previous three Core Challenges are usually interwound such that the increase in the sample size (big N) and the number of traits (big P) for each sample usually happen simultaneously and lead to increased heterogeneity that requires more complex models (big M). For example, recent studies in viral evolution have seen a continuing increase in the sample size that the West Nile virus, Dengue, HIV, and Ebola virus studies involve 104, 352, 465, and 1610 sequences [68–71], and the GISAID database has collected COVID‐19 genomic sequences by the end of August 2020 [72].

To accommodate the increasing size and heterogeneity in the data and be able to apply the aforementioned efficient gradient‐based algorithms, Ji et al. [73] propose a linear‐time algorithm for calculating an ‐dimensional gradient on a tree w.r.t. the sequence evolution. The linear‐time gradient algorithm calculates each branch‐specific derivative through a preorder traversal that complements the postorder traversal from the likelihood calculation of the observed sequence data at the tip of the phylogeny by marginalizing over all possible hidden states on the internal nodes. The pre‐ and postorder traversals complete the Baum's forward–backward algorithm in a phylogenetic framework [74]. The authors then apply the gradient algorithm with HMC (Section 2.2) samplers to learn the branch‐specific viral evolutionary rates.

Thanks to these advanced computational methods, one can employ more flexible models that lend themselves to more realistic reconstructions and uncertainty quantification. Following a random‐effects relaxed clock model, they model the evolutionary rate of branch on a phylogeny as the product of a global treewise mean parameter and a branch‐specific random effect . They model the random‐effect s as independent and identically distributed from a lognormal distribution such that has mean 1 and variance under a hierarchical model where is the scale parameter. To accommodate the difference in scales of the variability in the parameter space for the HMC sampler, the authors adopt preconditioning with adaptive mass matrix informed by the diagonal entries of the Hessian matrix. More precisely, the nonzero diagonal elements of the mass matrix truncate the values from the first HMC iterations of so that the matrix remains positive‐definite and numerically stable. They estimate the treewise (fixed‐effect) mean rate with posterior mean 4.75 ( Bayesian credible interval: ) substitutions per site per year with rate variability characterized by scale parameter with posterior mean for serotype 3 of Dengue virus with a sample size of 352 [69]. Figure 1 illustrates the estimated maximum clade credible evolutionary tree of the Dengue virus dataset.

The authors report relative speedup in terms of the effective sample size per second (ESS/s) of the HMC samplers compared to a univariate transition kernel. The “vanilla” HMC sampler with an identity mass matrix gains speedup for the minimum ESS/s and speedup for the median ESS/s, whereas the “preconditioned” HMC sampler gains and speedups, respectively. Critically, the authors make these performance gains available to scientists everywhere through the popular, open‐source software package for viral phylogenetic inference Bayesian evolutionary analysis by sampling trees (BEAST) [75]. In Section 4.1, we discuss how software package such as BEAST addresses Core Challenge 4, the creation of fast, flexible, and friendly statistical algo‐ware.


Figure 1 A nontraditional and critically important application in computational statistics is the reconstruction of evolutionary histories in the form of phylogenetic trees. Here is a maximum clade credible tree of the Dengue virus example. The dataset consists of sequences of the serotype of the Dengue virus. Branches are coded by the posterior means of the branch‐specific evolutionary rates according to the gradient bar on the top left. The concentric circles indicate the timescale with the year numbers. The outer ring indicates the geographic locations of the samples by the color code on the bottom left. ‘I’ and ‘II’ indicate the two Brazilian lineages as in the original study.

Computational Statistics in Data Science

Подняться наверх