Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 47
3.2 C++
ОглавлениеC is a general‐purpose, high‐performance programming language. Unlike other scripting languages for statistics such as R and Python, C is a compiled language – adding complexity (such as memory management) and strict syntax requirements. As such, C's design may complicate prototyping. Thus, data scientists typically turn to C to optimize/scale a developed algorithm at the production level.
C's standard libraries lack many mathematical and statistical operations. However, since C can be compiled cross‐platform, developers often interface C functions from different languages (e.g., R and Python). Thus, C can be used to develop libraries across languages, offering impressive computing performance.
To enable analysis, developers created mathematical and statistical libraries in C. The packages often employ of BLAS (basic linear algebra subprograms) libraries, written in C/Fortran and offer numerous low‐level, high‐performance linear algebra operations on numbers, vectors, and matrices. Some popular BLAS‐compatible libraries include Intel Math Kernel Library (MKL) [12], automatically tuned linear algebra software (ATLAS) [13], OpenBLAS [14], and linear algebra package (LAPACK) [15].
Among the C libraries for mathematics and statistics built on top BLAS, we detail three popular, well‐maintained libraries: Eigen [16], Armandillo [17], and Blaze [18] below:
Eigen is a high‐level, header‐only library developed by Guennebaud et al. [16]. Eigen provides classes dealing with vector types, arrays, and dense/sparse/large matrices. It also supports matrix decomposition and geometric features. Eigen uses single instruction multiple data vectorization to avoid dynamic memory allocation. Eigen also implements extra features to optimize the computing performance, including unrolling techniques and processor‐cache utilization. Eigen itself does not take much advantage from parallel hardware, currently supporting parallel processing only for general matrix–matrix products. However, since Eigen uses BLAS‐compatible libraries, users can utilize external BLAS libraries in conjunction with Eigen for parallel computing. Python and R users can call Eigen functions using the minieigen and RcppEigen packages.
The National ICT Australia (NICTA) developed the open‐source library Armadillo to facilitate science and engineering [17]. Armadillo provides a fast, easy‐to‐use matrix library with MATLAB‐like syntax. Armadillo employs template meta‐programming techniques to avoid unnecessary operations and increase library performance. Further, Armadillo supports 3D objects and provides numerous utilities for matrices manipulation and decomposition. Armadillo automatically utilizes open multiprocessing (OpenMP) [19] to increase speed. Developers designed Armadillo to provide a balance between speed and ease of use. Armadillo is widely used for many applications in ML, pattern recognition, signal processing, and bioinformatics. R users may call Armadillo functions through the RcppArmadillo package.
Blaze is a high‐performance math library for dense/sparse arithmetic developed by Iglberger et al. [18]. Blaze extensively uses LAPACK functions for various computing tasks, such as matrix decomposition and inversion, providing high‐performance computing. Blaze supports high‐performance parallex (HPX) [20] and OpenMP to enable parallel computing.
The difficulty to develop C programs limits its use as a primary statistical software package. Yet, C appeals when a fast, production‐quality program is desired. Therefore, R and Python developers may find C knowledge beneficial to optimize their code prior to distribution. We see C/C as the standard for speed and, as such, an attractive tool for big data problems.