Читать книгу The Informed Company - Dave Fowler - Страница 10
ОглавлениеAbout This Book
Why Write This Book
Most comprehensive books on analytics architecture that we've found are over a decade old, most of them pre‐cloud. Because there really isn't a modern equivalent to Kimball's seminal The Data Warehouse Toolkit, today's data teams have to reinvent the principles of building a data stack. Too often, they do this without guidance. To solve this problem, we have created a best‐practices guide for bootstrapping and nurturing a technologically current data warehouse.
Who This Book Is For
We wrote this book for whoever values data and believes that informed companies are competitive. It's a book for the working professional who is creating a practical, modern data stack. It's for the lone analyst or the professional embedded in a team. It's for anyone interested in what design practices underlie robust data architecture, the kind that equips entire companies with business intelligence insights. At its heart, this book is written with collaboration in mind (Figure A.1).
Figure A.1 Data management is a collaborative process.
Who This Book Is Not For
This book is not written for “big data” professionals. To be clear, even large corporations like Doordash, Discord, and the owners of The Financial Times and The New York Times (all previous customers of ours) do not qualify as big data companies. As a rule of thumb, the big data label applies to data architectures with raw input that exceeds 100 GB per day.
No doubt, many elements of this text map onto the big data workflow, especially since warehouses support all sorts of tables, not just, say, event streams. However, our aim is to focus on the central pillars of a modern data stack, so that the widest set of readers can readily benefit from the information herein. In this spirit, we forgo recommendations for mega‐scale architectures.
This book is not for AI‐enabled teams and does not cover AI workflows, machine learning models, or real‐time operational use cases. Instead, its goal is to provide best practices for building and maintaining a robust data analytics stack (i.e. the analytics foundation on which an AI workflow can be built).
If you are a small business that can run everything with Quickbooks and Excel, that ability is great. Data is important for all companies, but if these tools are already serving you well, the book may not offer helpful guidance. If you start exceeding the data capacity of Excel or bring in a data source that needs to be in a database to be analyzed, then keep reading.
Who Wrote the Book
This book was written by Dave Fowler and Matt David.
Dave Fowler has worked in BI for over a decade, and has always looked for ways to JOIN teams ON data
. He wants to enable any working professional (not just data analysts) to explore and understand their data. As the founder and CEO of Chartio, Dave has spent the last 11 years leading the development of a self‐service BI product that aims to do just that. Chartio's suite of tools make it easy for anyone at a data‐driven business to browse their schemas, merge various data sources, and produce beautiful dashboards. In March 2021, Atlassian acquired Chartio and is integrating it into their platform.
Matt David has worked in product management and education for eight years. As data becomes a necessary skill for more and more jobs, he passionately advocates for data literacy among the workforce. As the current head of The Data School, he oversees the production of free, online resources focused on leveraging data within companies. Recent book topics include SQL optimization, data governance, and common analysis biases. Dave started The Data School, and together he and Matt have grown it into an important free resource for the data community. He previously worked at Udacity and General Assembly teaching analytics.
Dave and Matt decided to co‐write this book after seeing how many people struggle when constructing data stacks and then trying to use them. This book was created with the support of many employees at Chartio. They graciously provided insights into how customers model their data and collected frequently asked data‐infrastructure questions. Their contributions guided the production of this text.
Who Edited the Book
This book was reviewed and edited by Emilie Schario, Mila Page, and David Yerrington. Emilie is the head of data at Netlify and previously helped build Gitlab's entire data organization. She regularly writes and speaks on all things related to modern data. Mila is a developer relations advocate at dbt Labs, the makers of dbt (data build tool). She helps data professionals learn and apply modern analytics‐engineering practices, and is an organizer for Coalesce, the dbt Community’s annual conference. David is a Data Science Consultant and was the Global Lead Data Science Instructor at General Assembly. He helps people around the world better leverage their data. Emilie, Mila, and David have shaped the narrative and content of this book. Their (sometimes) line‐by‐line feedback has ensured that we can proudly stand behind our recommendations.
Influences
We've drawn on several sources of information and opinion when writing this text. While at Chartio, we worked with hundreds of modern cloud‐based customers. We've collected, implemented, and refined these practices ourselves, and through writing this book, vetted them further with partners and customers. We've also learned from the data community through dataschool.com, blogs like Tristan Handy's, and data‐focused slack communities.
And lastly, it's worth noting and thanking some classic books that informed the previous generation of warehousing toolkits. We honor them by echoing their terminology and best practices wherever possible:
Agile Data Warehouse Design by Lawrence Corr
The Data Warehouse Toolkit by Ralph Kimball
Information Dashboard Design by Stephen Few (my review here)
How This Book Was Written
This book originates in part from a project within The Data School (Figure A.2), a collection of free online books and interactive tutorials on managing and leveraging data (see dataschool.com). These resources are always expanding, much like the articles of Wikipedia: each round of updates sees our ebooks cover additional topics, go deeper on established ideas, share more real‐world examples, and better deliver that content. Our goal is to maintain and improve these resources and keep them modern.
Source: The Data School
Few are complete “experts” in all of the areas of modern data governance, and the landscape is changing all of the time. If you have a story to share, or a chapter you think is missing, or a new idea, email us. Even if you don't know what specifically to share, but you don't mind sharing your story, please reach out as we are particularly interested in adding real‐world experiences and insights.
There is already too much jargon in the data world, often created by talented vendor marketing teams. We try to stick with the most common and straightforward words that are already in use. For any jargon we do find necessary, we include a definition.
There are many books for the old ways of working with data. We're highlighting current best practices here, so we ignore outdated terminology and techniques. In a few cases where it is beneficial to talk about industry evolution—like the change from ETL to ELT—we teach ELT and discuss the choice in a separate chapter.
Almost every part of this book could be contentious to someone, in some use case or to some vendor. In writing this book, it is tempting to bring up the caveats everywhere and write what would ultimately be a very defensive and overly explained book. We believe this type of book is way less useful for people seeking straightforward advice. Where we have a strong opinion, we don't argue it; we just go with it. Where we think the user has a legitimate choice to make, we pose those options.
This book aims to provide a broad overview and general guidelines on how to set up a data stack. We intentionally gloss over the details of launching a Redshift instance, writing SQL, or using various BI products. That would clutter the text, repeat what's already on the internet, and make the read quite stale.
How to Read This Book
The book starts with a quick overview and decision charts about what the stages are and what stage is appropriate for you. This book is structured with a section for each of the four stages, and if you'd like, you can jump ahead to the stage you're at.
Not every company needs the entirety of this book. As a growing company's data needs expand, more and more of the book becomes valuable. Note, though, many best practices presented at each stage appear when they start to be relevant. These practices assume they are useful from the point they appear in the book, onward, to avoid redundancy. So it may benefit you to at least skim those earlier stages even if you and your company are further ahead.
At the end of the book we have a section where we describe what has changed in the data world that makes this new architecture relevant and performant. We avoid explaining how our recommendations differ from previous practices like Kimball Dimensional modeling so as not to clutter the experience. Such discussions are necessary, however, and we've put them in this last section of the book.
Lastly, throughout the book you will see the following icons:
Definitions
They are related to a term found on the same page. For example, on this page, the term “data lake” is mentioned. A data lake is a staging area for several data sources.
Protips
Protips expand on an idea or provide additional information about a topic related to what you read within a given chapter.