Читать книгу Introduction to Python Programming for Business and Social Science Applications - Frederick Kaefer - Страница 7
ОглавлениеPreface
Why Python?
This textbook focuses on the Python programming language, but some readers might ask, “Why use Python instead of more established statistical packages such as IBM® SPSS® (Statistical Package for the Social Sciences) and SAS® (Statistical Analysis System) or other programming languages such as R?” Several advantages that Python has compared to SPSS and SAS are that Python is open source and can run on many platforms. Users are free to make copies, distribute, and even change the software. Like Python, the R programming language is also an open-source programming language used for data analytics. There is prolific use of both R and Python in the business world and for academic and research purposes. We focus on Python because we see a need for a text that presents Python programming specifically for those in the fields of social sciences and business to develop applications for data analytics. Whereas some may prefer R for statistical analysis and plotting charts, Python is a general-purpose scripting language used to develop applications with graphical user interfaces (GUIs), and many favor Python when working with text-based data. While there are possibilities of developing GUI applications using R, it is cumbersome to develop applications using R the way we approach them in this book. In our experience, covering the development of GUIs using the Python tkinter package (Chapter 11) right after control logic (Chapter 4) has piqued the interest of students through the creation of interactive, user-friendly applications. Although working with the Python tkinter package has some challenges, we have found that after successfully developing GUI-based applications, students are well prepared to work with other packages. Users of our textbook will become comfortable working with packages, and since there are over 200,000 Python packages available in the Python package repository, this is likely the most valuable takeaway from our textbook.
Pedagogy
Our book does not assume that the reader has any background in programming or statistics. This book has detailed code examples specified with line numbers to point out specific elements of Python as we cover them. Examples in each chapter focus on two large-scale data sets (Chicago Taxi Trips and the General Social Survey), building from the basics in the beginning chapters to the actual usage and analysis of the data sets in later chapters. These code examples are all available on a companion website at study.sagepub.com/researchmethods/statistics/kaefer-intro-to-python so that readers can execute the code examples to “learn by doing” as we introduce them in the material. We also include exercises in each chapter for readers to apply the concepts as they learn them.
Each chapter contains the following features:
Introduction
Key terms bolded when first defined
Tables
Figures with code examples to point out usage
Stop, Code, and Understand! exercises after topics are covered
Chapter summary
Glossary with key terms from the chapter
End-of-chapter exercises
We also provide an appendix at the back of the book with solutions to all Stop, Code, and Understand! exercises and resources for further reference.
Approach
This book is uniquely suited to people in the fields of business and social science who are learning programming for data analysis applications. Business and social science students learning programming need to have meaningful examples that are relevant to their field, in which they can see the value in the software applications developed. Several essential components are (1) the extraction of data from a database and/or the web, (2) the statistical analysis and visualization of the data to support decision making, and (3) the development of a graphical user interface that both makes applications more inviting for users and limits possible errors. Through our careful presentation and explanation of these components, students will be more motivated to learn Python and inspired to delve deeper into additional details that we are not presenting in depth.
One of our primary goals is that students using this textbook will develop skills in using a variety of modules and packages. Students will see the tremendous appeal of Python through working with Python modules in packages (including matplotlib, NumPy, Pandas, scikit-learn, SciPy, seaborn, and tkinter), learning the benefits of an interactive development environment (IDLE), and using a package manager (pip). We begin by developing a simple module and using it in Python code in Chapter 2. After covering basic Python features in Chapters 3 and 4, we progress to using Python built-in modules and then modules that are available in installed packages. A table of modules and packages used in the book (and the corresponding textbook figure in which they first appear) immediately follows this preface. Readers of our book can use Python modules created for other purposes after using the variety of modules and packages covered in this book, including both existing Python modules as well as those developed in the future.
The primary market for this book is any social science or business undergraduate-level or graduate-level introductory course in Python programming. This book is for courses that focus on the development of applications using Python, particularly business and social science applications. This textbook assumes no prerequisite knowledge or coursework in computer programming or statistics. The intended course is the first technical course in a data science certificate or MBA-level program. We use data from two very large real-world data sets (the General Social Survey data set and the City of Chicago’s Taxi Trips data set) systematically throughout the book. By the end of a course using this textbook, students will be able to work with large data sets to build statistical models and visualize results. Novice learners following our approach will find it easy to build their technical knowledge and motivation. Our focus on the use of Python modules and packages facilitates students’ learning and prepares students to leverage Python for future purposes.
After taking a course using this textbook, students will be prepared for more advanced courses that require data analysis and use statistics or for research. They will be prepared to conduct analysis on large data sets using Python, learning from our mix of explanation and examples. Finally, students will have a solid foundation to continue building their technical abilities that they developed from this book.
Data Sets
This textbook develops examples and applications using two data sets that are publicly available and represent real-world data science problems. The first data set, the General Social Survey (GSS), is appealing to those with an interest in social sciences. The National Data Program for the Social Sciences has run the GSS since 1972 (http://www.gss.norc.org/About-The-GSS). You can explore the data online using a data explorer or download the complete data sets (http://www.gss.norc.org/Get-The-Data). We downloaded the data sets in SPSS format and used SPSS to import the data and export them to a CSV file. The full data set has over 5,800 variables covering a wide range of survey questions asked, with more than 62,000 responses from over four decades. We will not explore every variable or response but will investigate patterns and trends in the data.
The second data set, Chicago Taxi Trips, is appealing to those interested in business applications. The Taxi Trips data set is publicly available through the City of Chicago. The data set has more than 100 million records of taxi trips, with 26 fields (variables) per record, including duration, fare, tips, and the GPS coordinates of pickups and drop-offs. More information on this data set is available at https://digital.cityofchicago.org/index.php/chicago-taxi-data-released/. We present examples analyzing these data in many ways, including predictions of trip fares based on miles traveled and length in minutes. We also access data directly from the City of Chicago’s application programming interface (API) in Chapter 7.
We use samples from each of the two data sets in examples and exercises throughout this book. Examples include the following: Chapter 1 introduces both data sets and Chapter 2 has beginning code examples that include data like the data found in the taxi trips data set. In Chapter 3, GSS data illustrate tuples and dictionaries to look up a value corresponding to a key value. In Chapter 4, examples from both data sets illustrate control logic examples, including list comprehension. In Chapter 5, a CSV file with GSS data illustrates working with specific columns in a CSV file. In addition, in Chapter 5, a Microsoft Access database based on the taxi trip data illustrates working with data in a relational database file using Structured Query Language (SQL) with the pyodbc package. In Chapter 6, we use taxi trip data to illustrate features from the NumPy and Pandas packages, and a data set from the GSS is used to illustrate data cleaning and preparation using the Pandas package. In Chapter 7, the BeautifulSoup package illustrates how it is not always as easy as one might expect to obtain data by web scraping them from a web page (using the GSS website). In Chapter 7, we also use REST API queries to obtain data from the taxi trips data set directly from the Chicago Data Portal website. In Chapter 8, variables from both the GSS and taxi trips data sets illustrate statistical analysis. In Chapter 9, data in both data sets demonstrate how the matplotlib package visualizes data. In Chapter 10, both the GSS data set and the taxi trip data set illustrate different machine learning classification techniques. In Chapter 11, we develop a graphical user interface using the tkinter package with data from the taxi trips data set. Two tables that more carefully detail the examples presented by data set throughout the textbook immediately follow this preface.
Digital Resources
Visit study.sagepub.com/researchmethods/statistics/kaefer-intro-to-python for downloadable study resources to accompany this text. Resources include Python code files, data sets, and Stop, Code and Understand! exercises and solutions.