Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 25
4.2 Hardware‐Optimized Inference
ОглавлениеBut successful statistical inference software must also interact with computational hardware in an optimal manner. Growing datasets require the computational statistician to give more and more thought to how the computer implements any statistical algorithm. To effectively leverage computational resources, the statistician must (i) identify the routine's computational bottleneck (Section 2.1) and (ii) algorithmically map this rate‐limiting step to available hardware such as a multicore or vectorized CPU, a many‐core GPU, or – in the future – a quantum computer. Sometimes, the first step is clear theoretically: a naive implementation of the high‐dimensional regression example of Section 3.1 requires an order matrix multiplication followed by an order Cholesky decomposition. Other times, one can use an instruction‐level program profiler, such as INTEL VTUNE (Windows, Linux) or INSTRUMENTS (OSX), to identify a performance bottleneck. Once the bottleneck is identified, one must choose between computational resources, or some combination thereof, based on relative strengths and weaknesses as well as natural parallelism of the target task.
Multicore CPU processing is effective for parallel completion of multiple, mostly independent tasks that do not require intercommunication. One might generate 2 to, say, 72 independent Markov chains on a desktop computer or shared cluster. A positive aspect is that the tasks do not have to involve the same instruction sets at all; a negative is latency, that is, that the slowest process dictates overall runtime. It is possible to further speed up CPU computing with single instruction, multiple data (SIMD) or vector processing. A small number of vector processing units (VPUs) in each CPU core can carry out a single set of instructions on data stored within an extended‐length register. Intel's streaming SIMD extensions (SSE), advance vector extensions (AVX), and AVX‐512 allow operations on 128‐, 256‐, and 512‐bit registers. In the context of 64‐bit double precision, theoretical speedups for SSE, AVX, and AVX‐512 are two‐, four‐, and eightfold. For example, if a computational bottleneck exists within a for‐loop, one can unroll the loop and perform operations on, say, four consecutive loop bodies at once using AVX [21, 22]. Conveniently, languages such as OPENMP [97] make SIMD loop optimization transparent to the user [98]. Importantly, SIMD and multicore optimization play well together, providing multiplicative speedups.
While a CPU may have tens of cores, GPUs accomplish fine‐grained parallelization with thousands of cores that apply a single instruction set to distinct data within smaller workgroups of tens or hundreds of cores. Quick communication and shared cache memory within each workgroup balance full parallelization across groups, and dynamic on‐ and off‐loading of the many tasks hide the latency that is so problematic for multicore computing. Originally designed for efficiently parallelized matrix math calculations arising from image rendering and transformation, GPUs easily speed up tasks that are tensor multiplication intensive such as deep learning [99] but general‐purpose GPU applications abound. Holbrook et al. [21] provide a larger review of parallel computing within computational statistics. The same paper reports a GPU providing 200‐fold speedups over single‐core processing and 10‐fold speedups over 12‐core AVX processing for likelihood and gradient calculations while sampling from a Bayesian multidimensional scaling posterior using HMC at scale. Holbrook et al. [22] report similar speedups for inference based on spatiotemporal Hawkes processes. Neither application involves matrix or tensor manipulations.
A quantum computer acts on complex data vectors of magnitude 1 called qubits with gates that are mathematically equivalent to unitary operators [100]. Assuming that engineers overcome the tremendous difficulties involved in building a practical quantum computer (where practicality entails simultaneous use of many quantum gates with little additional noise), twenty‐first century statisticians might have access to quadratic or even exponential speedups for extremely specific statistical tasks. We are particularly interested in the following four quantum algorithms: quantum search [101], or finding a single 1 amid a collection of 0s, only requires queries, delivering a quadratic speedup over classical search; quantum counting [102], or finding the number of 1s amid a collection of 0s, only requires (where is the number of 1s) and could be useful for generating p‐values within Monte Carlo simulation from a null distribution (Section 2.1); to obtain the gradient of a function (e.g., the log‐likelihood for Fisher scoring or HMC) with a quantum computer, one only needs to evaluate the function once [103] as opposed to times for numerical differentiation, and there is nothing stopping the statistician from using, say, a GPU for this single function call; and finally, the HHL algorithm [104] obtains the scalar value for the ‐vector satisfying and and matrix in time , delivering an exponential speedup over classical methods. Technical caveats exist [105], but HHL may find use within high‐dimensional hypothesis testing (big ). Under the null hypothesis, one can rewrite the score test statistic
for and , the Fisher information and log‐likelihood gradient evaluated at the maximum‐likelihood solution under the null hypothesis. Letting and , one may write the test statistic as and obtain it in time logarithmic in . When the model design matrix is sufficiently sparse – a common enough occurrence in large‐scale regression – to render itself sparse, the last criterion for the application of the HHL algorithm is met.