Читать книгу Practical Data Analysis with JMP, Third Edition - Robert Carver - Страница 5
ОглавлениеAbout This Book
What Does This Book Cover?
Purpose: Learning to Reason Statistically
We live in a world of uncertainty. Today more than ever before, we have vast resources of data available to shed light on crucial questions. But at the same time, the sheer volume and complexity of the “data deluge” can distract and overwhelm us. The goal of applied statistical analysis is to work with data to calibrate, cope with, and sometimes reduce uncertainty. Business decisions, public policies, scientific research, and news reporting are all shaped by statistical analysis and reasoning. Statistical thinking is an essential part of the boom in “big data analytics” in numerous professions. This book will help you use and discriminate among some fundamental techniques of analysis, and it will also help you engage in statistical thinking by analyzing real problems. You will come to see statistical investigations as an iterative process and will gain experience in the major phases of that process.
To be an effective analyst or consumer of other people’s analyses, you must know how to use these techniques, when to use them, and how to communicate their implications. Knowing how to use these techniques involves mastery of computer software like JMP. Knowing when to use these techniques requires an understanding of the theory underlying the techniques and practice with applications of the theory. Knowing how to effectively communicate with consumers of an analysis or with other analysts requires a clear understanding of the theory and techniques, as well as clarity of expression, directed toward one’s audience.
There was a time when a first course in statistics emphasized abstract theory, laborious computation, and small sets of artificial data—but not practical data analysis or interpretation. Those days are thankfully past, and now we can address all three of the skill sets just cited.
Scope and Structure of This Book
As a discipline, statistics is large and growing; the same is true of JMP. One paperback book must limit its scope, and the content boundaries of this book are set intentionally along several dimensions.
First, this book provides considerable training in the basic functions of JMP 15. JMP is a full-featured, highly interactive, visual, and comprehensive package. The book assumes that you have the software at your school or office. The software’s capabilities extend far beyond an introductory course, and this book makes no attempt to “cover” the entire program. The book introduces students to its major platforms and essential features and should leave students with sufficient background and confidence to continue exploring on their own. Fortunately, the Help system and accompanying manuals are quite extensive, as are the learning resources available online at http://www.jmp.com.
Second, the chapters largely follow a traditional sequence, making the book compatible with many current texts. As such, instructors and students will find it easy to use the book as a companion volume in an introductory course. Chapters are organized around core statistical concepts rather than software commands, menus, or features. Several chapters include topics that some instructors might view as “advanced”—typically when the output from JMP makes it a natural extension of a more elementary topic. This is one way in which software can redefine the boundaries of introductory statistics.
Third, nearly all the data sets in the book are real and are drawn from those disciplines whose practitioners are the primary users of JMP software. Inasmuch as most undergraduate programs now require coursework in statistics, the examples span major areas in which statistical analysis is an important path to knowledge. Those areas include engineering, life sciences, business, and economics.
Fourth, each chapter invites students to practice the habits of thought that are essential to statistical reasoning. Long after readers forget the details of a particular procedure or the options available in a specific JMP analysis platform, this book may continue to resonate with valuable lessons about variability, uncertainty, and the logic of inference.
Each chapter concludes with a set of “Application Scenarios,” which lay out a problem-solving or investigative context that is in turn supported by a data table. Each scenario includes a set of questions that implicitly require the application of the techniques and concepts presented in the chapter.
New in the Third Edition
This edition preserves much of the content and approach of the earlier editions, while updating examples and introducing new JMP features. As in the second edition, there are three review chapters (Chapters 5, 9, and 17) that pause to recap concepts and techniques. One of the perennial challenges in learning statistics is that it is easy to lose sight of major themes as a course progresses through a series of seemingly disconnected techniques and topics. Some readers should find the review chapters to be helpful in this respect. The review chapters share a single large data set of World Development Indicators, published by the World Bank.
The scope and sequence of chapters is basically the same as the prior edition. There is some additional new material about the importance of documenting one’s work with an eye toward reproducibility of analyses, as well as production of presentation-ready reporting. The second edition was based on JMP 11, and since that time, platforms have been added or modified, and some functionality has relocated in the menu system. This edition captures those changes.
Some of the updated data tables are considerably larger than their counterparts in earlier editions. This creates the opportunity to demonstrate methods for meaningful graphs when data density and overplotting become issues. I also use some of the larger data tables to introduce machine learning practices like partitioning a data set into training and validation sets.
JMP Projects are introduced in Chapter 2 and used throughout the book. Projects are a way to organize, preserve, and document multiple analyses using multiple data tables. They naturally support a logical and reproducible workflow. Using projects is a way for newcomers to establish good habits and for JMP veterans to be more efficient.
Other additions and amendments include:
● Early introduction of more data types, Header Graphs, and JMP Public.
● Expanded use of Subset, Global and Local Data Filters and Animate. In the prior editions, for example, the set of data tables included some subsets of larger tables. Because data preparation is such an important part of the analytical cycle, readers learn to perform filtering and subsetting functions on their own.
● The Recode command has evolved since JMP 11, as have the lessons using Recode. Readers will learn why and how to recode a column.
● In the Regression chapters, coverage of the Profiler has expanded, and I have added the Partition Platform to the discussion of variable selection. The Fit Curve platform also makes its first appearance, as do temporary variable transformations.
● For JMP Pro users, there is a brief treatment of the Formula Depot to facilitate comparison of models.
● In Chapter 21 on Design of Experiments, we meet Definitive Screening Designs.
● In Chapter 22, Variability Charts have been added.
● Simulators and calculators previously supplied as JSL scripts in earlier editions have been bundled among JMP’s teaching demonstrations in the Help system. The text now reflects this very useful change.
Is This Book for You?
Intended Audience
This book is intended to supplement an introductory college-level statistics course with real investigations of some important and engaging problems. Each chapter presents a set of self-paced exercises to help students learn the skills of quantitative reasoning by performing the types of analyses that typically form the core of a first course in applied statistics. Students can learn and practice the software skills outside of class. Instructors can devote class time to statistics and statistical reasoning, rather than to rudimentary software instruction. Both students and teachers can direct their energies to the practice of data analysis in ways that inform students’ understanding of the world through investigations of problems that matter in various fields of study.
Though written with undergraduate and beginning graduate students in mind, some practitioners might find the book helpful on the job and are well-advised to read the book selectively to address current tasks or projects. Chapters 1 and 2 form a good starting point before reading later sections. Appendix B (online for this edition) covers several data management topics that might be helpful for readers who undertake projects involving disparate data sources.
Prerequisites
No prior statistical knowledge is presumed. A basic grounding in algebra and some familiarity with the Mac OS or Windows environment are all you need in advance. An open, curious mind is also helpful.
A Message for Instructors
I assume that most teachers view class time as a scarce resource. One of my goals in writing this book was to strive for clarity throughout so that students can be expected to work through the book on their own and learn through their encounters with the examples and exercises. This book may be especially welcome for instructors using an inverted, or flipped, classroom approach.
Instructors might selectively use exercises as in-class demonstrations or group activities, interspersing instruction or discussion with computer work. More often, the chapters and scenarios can serve as homework exercises or assignments, either to prepare for other work, to acquire skills and understanding, or to demonstrate progress and mastery. Finally, some instructors might want to assign a chapter in connection with an independent analysis project. Several of the data tables contain additional variables that are not used within chapters. These variables might form the basis for original analyses or explorations.
The bibliography may also aid instructors seeking additional data sources or background material for exercises and assignments. Tips for classroom use of JMP are also available at the book’s website, accessible through the author’s page at support.sas.com/carver.
A Message for Students
Remember that the primary goal of this book is to help you understand the concepts and techniques of statistical analysis. JMP provides an ideal software environment to do just that. Naturally, each chapter is “about” the software and at times you will find yourself focusing on the details of a JMP analysis platform and its options. If you become entangled in the specifics of a problem, step back and try to refocus on the main statistical ideas rather than software issues.
This book should augment, but not replace, your primary textbook or your classroom time. To get the maximum benefit from the book, work mindfully and carefully. Read through a chapter before you sit down at the computer. Each chapter will require approximately 30 minutes of computer time; work at your own pace and take your time. Remember that variability is omnipresent, so expect that the time you need to complete a chapter may be more or less than 30 minutes.
The Application Scenarios at the end of each chapter are designed to reinforce and extend what you have learned in the chapter. The questions in this section are designed to challenge you. Sometimes, it is obvious how to proceed with your analysis; sometimes, you will need to think a bit before you issue your first command. The idea is to engage in statistical thinking, integrating what you have learned throughout your course. There is much more to data analysis than finding a numerical answer, and these questions provide an opportunity to do realistic analysis. Because the examples use real data, don’t expect to find neat “pat” results; computations won’t typically come out to nice round numbers.
JMP is a large program designed for diverse user needs. Many of the features of the software are beyond the scope of an introductory course, and therefore this book does not discuss them. However, if you are curious or adventurous, you should explore the menus and Help system as well as the JMP website. You might find a quicker, more intuitive, or more interesting way to approach a problem. For most of the topics addressed in the book, you will see an introduction. There is almost always more to know.
What Should You Know about the Examples?
Real statistical investigations begin with pressing, important, or interesting questions, rather than with a set of techniques. Researchers do not begin a study by saying “Today is a good day to compute some standard deviations.” Instead, they pose questions that can be pursued by analyzing data and follow a relatively straightforward protocol to refine the question, generate or gather suitable data, apply appropriate methods, and interpret their findings. The chapters in this book present questions that I hope you will find interesting, and then rely on the data tables provided to search for answers. The questions and analyses become progressively more challenging through the book.
Software Used to Develop the Book’s Content
The book was developed using pre-production versions of JMP15 Pro. The essential examples work with JMP. Whenever a section illustrates JMP Pro functionality, that fact is clearly announced.
Example Data
As previously noted, each of the data tables referenced within the book contains real data, much of it downloaded from public websites. There are 45 different data tables, most of which have been updated for this edition. Readers should download all of the JMP data tables via the author page at support.sas.com/carver. Appendix A describes each file and its source. Many of the tables include columns (variables) in addition to those featured in exercises and examples. These variables might be useful for projects or other assignments.
Where Are the Exercise Solutions?
Solutions to the scenario questions are available via the author page at support.sas.com/carver. Instructors who adopt the book will be able to access all solutions. Students and other readers can find solutions to the even-numbered problems at the same site.
Thanks and Acknowledgments
This first edition of this book began at the urging of Curt Hinrichs, the Academic Program Manager for JMP. This led to conversations with Julie Palmieri, Editor-in-Chief at SAS Press at the time, after which the project started to take shape. I have had the great good fortune to work with a different editor for each edition: Stephenie Joyner, Sian Roberts, and, most recently, Catherine Connolly have kept this little trolley on the tracks.
At SAS Press, so many people have contributed to the planning and execution of the book in its development. For this edition, Sian Roberts is now Publisher. Suzanne Morgen handled the copyediting, Denise Jones the production, Robert Harris the cover design, and Missy Hannah the marketing effort.
Earlier editions were shaped and tended by Shelley Sessoms, Stacey Hamilton, Shelly Goodin, Mary Beth Steinbach, Cindy Puryear, Brenna Leath, Brad Kellam, Candy Farrell, Patrice Cherry, and Jennifer Dilley. My enduring thanks go to them all.
Many other professionals at JMP have influenced and informed the content of this book at critical points along the way. I am very grateful to John Sall, Xan Gregg, Jon Weisz, Brad Jones, Brady Brady, Jonathan Gatlin, Jeff Perkinson, Ian Cox, Chuck Pirrello, Brian Corcoran, Christopher Gotwalt, Curt Hinrichs, Mia Stephens, Volker Kraft, Julian Parris, Ruth Hummel, Kathleen Watts, Mary Loveless, Gail Massari, Lori Harris, Holly McGill, Peng Liu, and Eric Hill for encouraging me, answering my questions, setting me straight, and listening to my thoughts. To this group I send a special shout-out to JMP Senior Systems Engineer Rob Lievense, who has been a consistent advocate and supporter of this work.
I am especially thankful for the care and attention of those people who have reviewed this and the prior editions. Technical reviews of the current edition were provided by Mark Bailey, Duane Hayes, and Kristen Bradford. Mark has been the constant among reviewers, having made invaluable recommendations to all three editions. Performing double duty on the first two editions were Tonya Mauldin and Sue Walsh. Fang Chen, Paul Marovich, and Volker Kraft rounded out the many superb reviewers. Collectively, their critiques tightened and improved this book, and whatever deficiencies that may remain are entirely mine.
Naturally, the completion of a book requires time, space, and an amenable environment. I want to express public thanks to three institutions that provided facilities, time, and atmospherics suitable for steady work on this project. My home institution, Stonehill College, was exceptionally supportive, particularly through the efforts of Provost Joe Favazza and my chairperson, Debra Salvucci, and Department Administrative Assistant Carolyn McGuinness. Colleagues Dick Gariepy and Michael Salé generously tested several chapters and problems in their classrooms, and Jan Harrison and Susan Wall of our IT Department eased several technical aspects of this project as well.
Colleagues and students at the International Business School at Brandeis University sharpened my pedagogy and inspired numerous examples found in the book. During a sabbatical leave from Stonehill, Babson College was good enough to offer a visiting position and a wonderful place to write the first edition. For that opportunity, thanks go to Provost Shahid Ansari, former chairperson Norean Radke Sharpe, then-chair Steve Ericksen, and colleagues John McKenzie and George Recck.
During the summer of 2013, the Stonehill Undergraduate Research Experience (SURE) program provided a grant to support this work with time, space, and finances. Carolyn Moodie (Class of 2015) was a superior and self-directed research collaborator, assisting in the critical phases of problem formulation, data identification, data cleaning and exploratory analysis. Carolyn also brought a keen eye to editorial tasks, and willingly gave her feedback on which topics student readers might find engaging. Thanks also to Bonnie Troupe for her skillful administration of the SURE program. During the spring and summer of 2013, Stonehill students Dan Doherty, Erin Hollander, and Tate Molaghan also pitched in with editorial and research assistance.
Special acknowledgment also goes to former Stonehill students from BUS207 (Intermediate Statistics) who “road tested” several chapters, and very considerable thanks to three students who assisted greatly in shaping prose and examples, as well as developing solutions to scenario problems: Frank Groccia, Dan Bouchard, and Matt Arey. Later students in BUS206 (Quantitative Analysis for Business) at Stonehill also class-tested several chapters and exercises.
Several of the data tables came through the gracious permission of their original authors and compilers. I gratefully acknowledge the permission granted by my good friend George Aronson for the Maine SW table; by Prof. Max A. Little for the Parkinson’s disease vocal data; by Prof. Jesper Rydén for the Sonatas data table (from which the Haydn and Mozart tables were extracted); by Prof. John Holcomb for the North Carolina birth weight data; and by Prof. I-Cheng Yeh for the Concrete table and the two subsets from that data.
In recent years, my thoughts about what is important in statistics education have been radically reshaped by colleagues in the ISOSTAT listserv and the Consortium for the Advancement of Undergraduate Statistics Education (CAUSE) and the U.S. Conference on Teaching Statistics that CAUSE organizes every two years. The May 2013 CAUSE-sponsored workshop “Teaching the Statistical Investigation Process with Randomization-Based Inference” given by Beth Chance, Allan Rossman, and Nathan Tintle influenced some of the changes in my presentation of inference. Over an even longer period, our local group of New England Isolated Statisticians and the great work of the ASA’s Section on Statistics Education influence me daily in the classroom and at the keyboard.
Finally, it is a pleasure to thank my family. My sons, Sam and Ben, keep me modest and regularly provide inspiration and insight. My wife Donna—partner, friend, wordsmith extraordinaire—has my love and thanks for unflagging encouragement, support, and warmth. This book is dedicated to them.
We Want to Hear from You
SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit sas.com/books to do the following:
● Sign up to review a book
● Recommend a topic
● Request information on how to become a SAS Press author
● Provide feedback on a book
Do you have questions about a SAS Press book that you are reading? Contact the author through saspress@sas.com or https://support.sas.com/author_feedback.
SAS has many resources to help you find answers and expand your knowledge. If you need additional help, see our list of resources: sas.com/books.