Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 42
2.2 Python
ОглавлениеCreated by Guido van Rossum and released in 1991, Python is a hugely popular programming language [4]. Python features readable code, an interactive workflow, and an object‐oriented design. Python's architecture affords rapid application development from prototyping to production. Additionally, many tools integrate nicely with Python, facilitating complex workflows. Python also possesses speed, as most of its high‐performance libraries are implemented in C/C.
Python's core distribution lacks statistical features, prompting developers to create supplementary libraries. Below, we detail four well‐supported statistical and mathematical libraries: NumPy [5], SciPy [6], Pandas [7], and Statsmodels [8].
NumPy is a general and fundamental package for scientific computing [5]. NumPy provides functions for operations on large arrays and matrices, optimized for speed via a C implementation. The package features a dense, homogeneous array called ndarray. ndarray provides computational efficiency and flexibility. Developers consider NumPy a low‐level tool as only foundational functions are available. To enhance capabilities, other statistical libraries and packages use NumPy to provide richer features.
One widely used higher level package, SciPy, employs NumPy to enable engineering and data science [6]. SciPy contains modules addressing standard problems in scientific computing, such as mathematical integration, linear algebra, optimization, statistics, clustering, image, and signal processing.
Another higher level Python package built upon NumPy, Pandas, is designed particularly for data analysis, providing standard models and cohesive frameworks [7]. Pandas implements a data type named DataFrame – a concept similar to the data.frame object in R. DataFrame's structure features efficient methods for data sorting, splicing, merging, grouping, and indexing. Pandas implements robust input/output tools – supporting flat files, Excel files, databases, and HDF files. Additionally, Pandas provides visualization methods via Matplotlib [9].
Lastly, the package Statsmodels facilitates data exploration, estimation, and statistical testing [8]. Built at even a higher level than the other packages discussed, Statsmodels employs NumPy, SciPy, Pandas, and Matplotlib. Many statistical models exist, such as linear regression, generalized linear models, probability distributions, and time series. See http://www.statsmodels.org/stable/index.html for the full feature list.
In addition to the four libraries discussed above, Python features numerous other bespoke packages for a particular task. For ML, the TensorFlow and PyTorch packages are widely used, and for Bayesian inference, Pyro and NumPyro are becoming popular (see more on these packages in Section 4). For big data computations, PySpark provides scalable tools to handle memory and computation time issues. For advanced data visualization, pyplot, seaborn, and plotnine may be worth adopting for a Python‐inclined data scientist.
Python's easy‐to‐learn syntax, speed, and versatility make it a favorite among programmers. Moreover, the packages listed above transform Python into a well‐developed vehicle for data science. We see Python's popularity only increasing in the future. Some believe that Python will eventually eliminate the need for R. However, we feel that the immediate future lies in a Python + R paradigm. Thus, R users may well consider exploring what Python offers as the languages have complementary features.