Читать книгу Applied Univariate, Bivariate, and Multivariate Statistics Using Python - Daniel J. Denis - Страница 10
Preface
ОглавлениеThis book is an elementary beginner’s introduction to applied statistics using Python. It for the most part assumes no prior knowledge of statistics or data analysis, though a prior introductory course is desirable. It can be appropriately used in a 16-week course in statistics or data analysis at the advanced undergraduate or beginning graduate level in fields such as psychology, sociology, biology, forestry, education, nursing, chemistry, business, law, and other areas where making sense of data is a priority rather than formal theoretical statistics as one may have in a more specialized program in a statistics department. Mathematics used in the book is minimal and where math is used, every effort has been made to unpack and explain it as clearly as possible. The goal of the book is to obtain results using software rather quickly, while at the same time not completely dismissing important conceptual and theoretical features. After all, if you do not understand what the computer is producing, then the output will be quite meaningless. For deeper theoretical accounts, the reader is encouraged to consult other sources, such as the author’s more theoretical book, now in its second edition (Denis, 2021), or a number of other books on univariate and multivariate analysis (e.g., Izenman, 2008; Johnson and Wichern, 2007). The book you hold in your hands is merely meant to get your foot in the door, and so long as that is understood from the outset, it will be of great use to the newcomer or beginner in statistics and computing. It is hoped that you leave the book with a feeling of having better understood simple to relatively advanced statistics, while also experiencing a little bit of what Python is all about.
Python is used in performing and demonstrating data analyses throughout the book, but it should be emphasized that the book is not a specialty on Python itself. In this respect, the book does not contain a deep introduction to the software and nor does it go into the language that makes up Python computing to any significant degree. Rather, the book is much more “hands-on” in that code used is a starting point to generating useful results. That is, the code employed is that which worked for the problem under consideration and which the user can amend or adjust afterward when performing additional analyses. When it comes to coding with Python, there are usually several ways of accomplishing similar goals. In places, we also cite code used by others, assigning proper credit. There already exist a plethora of Python texts and user manuals that feature the software in much greater depth. Those users wishing to learn Python from scratch and become specialists in the software and aspire to become an efficient and general-purpose programmer should consult those sources (e.g. see Guttag, 2013). For those who want some introductory exposure to Python on generating data-analytic results and wish to understand what the software is producing, it is hoped that the current book will be of great use.
In a book such as this, limited by a fixed number of pages, it is an exceedingly difficult and challenging endeavor to both instruct on statistics and software simultaneously. Attempting to cover univariate, bivariate, and multivariate techniques in a book of this size in any kind of respectable depth or completeness in coverage is, well, an impossibility. Combine this with including software options and the impossibility factor increases! However, such is the nature of books that attempt to survey a wide variety of techniques such as this one – one has to include only the most essential of information to get the reader “going” on the techniques and advise him or her to consult other sources for further details. Targeting the right mix of theory and software in a book like this is the most challenging part, but so long as the reader (and instructor) recognizes that this book is but a foot-in-the-door to get students “started,” then I hope it will fall in the confidence band of a reasonable expectation. The reader wishing to better understand a given technique or principle will naturally find many narratives incomplete, while the reader hoping to find more details on Python will likewise find the book incomplete. On average, however, it is hoped that the current “mix” is of introductory use for the newcomer. It can be exceedingly difficult to enter the world of statistics and computing. This book will get you started. In many places, references are provided on where to go next.
Unfortunately, many available books on the market for Python are nothing more than slaps in the face to statistical theory while presenting a bunch of computer code that otherwise masks a true understanding of what the code actually accomplishes. Though data science is a welcome addition to the mathematical and applied scientific disciplines, and software advancements have made leaps and bounds in the area of quantitative analysis, it is also an unfortunate trend that understanding statistical theory and an actual understanding of statistical methods is sometimes taking a back seat to what we will otherwise call “generating output.” The goal of research and science is not to generate software output. The goal is, or at least should be, to understand in a deeper way whatever output that is generated. Code can be looked up far easier than can statistical understanding. Hence, the goal of the book is to understand what the code represents (at least the important code on which techniques are run) and, to some extent at least, the underlying mathematical and philosophical mechanisms of one’s analysis. We comment on this important distinction a bit later in this preface as it is very important. Each chapter of this book could easily be expanded and developed into a deeper book spanning more than 3–4 times the size of the book in entirety.
The objective of this book is to provide a pragmatic introduction to data analysis and statistics using Python, providing the reader with a starting point foot-in-the-door to understanding elementary to advanced statistical concepts while affording him or her the opportunity to apply some of these techniques using the Python language.
The book is the fourth in a series of books published by the author, all with Wiley. Readers wishing a deeper discussion of the topics treated in this book are encouraged to consult the author’s first book, now in its second (and better) edition titled Applied Univariate, Bivariate, and Multivariate Statistics: Understanding Statistics for Social and Natural Scientists, with Applications in SPSS and R (2021). The book encompasses a much more thorough overview of many of the techniques featured in the current book, featuring the use of both R and SPSS software. Readers wishing a book similar to this one, but instead focusing exclusively on R or SPSS, are encouraged to consult the author’s other two books, Univariate, Bivariate, and Multivariate Statistics Using R: Quantitative Tools for Data Analysis and Data Science and SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics. Each of these texts are far less theory-driven and are more similar to the current book in this regard, focusing on getting results quickly and interpreting findings for research reports, dissertations, or publication. Hence, depending on which software is preferred, readers (and instructors) can select the text best suited to their needs. Many of the data sets repeat themselves across texts. It should be emphasized, however, that all of these books are still at a relatively introductory level, even if surveying relatively advanced univariate and multivariate statistical techniques.
Features used in the book to help channel the reader’s focus:
Bullet points appear throughout the text. They are used primarily to detail and interpret output generated by Python. Understanding and interpreting output is a major focus of the book.
“Don’t Forget!”
Brief “don’t forget” summaries serve to emphasize and reinforce that which is most pertinent to the discussion and to aid in learning these concepts. They also serve to highlight material that can be easily misunderstood or misapplied if care is not practiced. Scattered throughout the book, these boxes help the reader review and emphasize essential material discussed in the chapters.
Each chapter concludes with a brief set of exercises. These include both conceptually-based problems that are targeted to help in mastering concepts introduced in the chapter, as well as computational problems using Python.
Most concepts are implicitly defined throughout the book by introducing them in the context of how they are used in scientific and statistical practice. This is most appropriate for a short book such as this where time and space to unpack definitions in entirety is lacking. “Dictionary definitions” are usually grossly incomplete anyway and one could even argue that most definitions in even good textbooks often fail to capture the “essence” of the concept. It is only in seeing the term used in its proper context does one better appreciate how it is employed, and, in this sense, the reader is able to unpack the deeper intended meaning of the term. For example, defining a population as the set of objects of ultimate interest to the researcher is not enlightening. Using the word in the context of a scientific example is much more meaningful. Every effort in the book is made to accurately convey deeper conceptual understanding rather than rely on superficial definitions.
Most of the book was written at the beginning of the COVID-19 pandemic of 2020 and hence it seemed appropriate to feature examples of COVID-19 in places throughout the book where possible, not so much in terms of data analysis, but rather in examples of how hypothesis-testing works and the like. In this way, it is hoped examples and analogies “hit home” a bit more for readers and students, making the issues “come alive” somewhat rather than featuring abstract examples.
Python code is “unpacked” and explained in many, though not all, places. Many existing books on the market contain explanations of statistical concepts (to varying degrees of precision) and then plop down a bunch of code the reader is expected to simply implement and understand. While we do not avoid this entirely, for the most part we guide the reader step-by-step through both concepts and Python code used. The goal of the book is in understanding how statistical methods work, not arming you with a bunch of code for which you do not understand what is behind it. Principal components code, for instance, is meaningless if you do not first understand and appreciate to some extent what components analysis is about.