The Informed Company
Реклама. ООО «ЛитРес», ИНН: 7719571260.
Оглавление
Dave Fowler. The Informed Company
Table of Contents
List of Tables
List of Illustrations
Guide
Pages
The Informed Company. How to Build Modern Agile Data Stacks that Drive Winning Insights
About This Book. Why Write This Book
Who This Book Is For
Who This Book Is Not For
Who Wrote the Book
Who Edited the Book
Influences
How This Book Was Written
How to Read This Book
Foreword
Introduction
Merging Business Context with Data Information
The Four Stages of Agile Data Organization
Stage 1. Sources
Stage 2. Lake
Stage 3. Warehouse (Single Source of Truth)
Warehouse
Stage 4. Marts
Mart
STAGE 1 SOURCE aka Siloed Data
Source Stage Overview
Chapter One Starting with Source Data
Common Options for Analyzing Source Data
Get to Know Data with Application Dashboards
Ask Interesting Questions with Excel
Query Databases with an SQLIDE
Visualize Data in the Office with Dashboards
Toying Around with Analytics Products
Look at Data Through a Business Intelligence Tool
Chapter Two The Need to Replicate Source Data
Replicate Sources
Create Read‐Only Access
Chapter Three Source Data Best Practices
Keep a Complexity Wiki Page
Snippet Dictionary
Use a BI Product
Double Check Results
Keep Short Dashboards
Design Before Building
STAGE 2 DATA LAKE aka Data Combined
Lake Stage Overview
Chapter Four Why Build a Data Lake?
What Is a Data Lake?
Reasons to Build a Data Lake Summarized. One Technology
JOINing
Robust Query Writing
Performance
Future Development
Chapter Five Choosing an Engine for the Data Lake
Modern Columnar Warehouse Engines
Columnar Storage Data Lakes
Deciding Factors
Modern Warehouse Engine Products
Amazon Redshift. Pros
Cons
Google BigQuery. Pros
Cons
Snowflake. Pros
Cons
Database Engines
PostgreSQL. Pros
Cons
Recommendation
Chapter Six Extract and Load (EL) Data
ETL versus ELT
EL/ETL Vendors
Extract Options
Complete Extraction
Incremental Extraction
Load Options
Push Changes
Store Separate
Multiple Schemas
Other Extract and Load Routes. Traditional ETL
DIY
Chapter Seven Data Lake Security
Access in Central Place
Permission Tiers
Chapter Eight Data Lake Maintenance
Why SQL?
Data Sources
Adding New Data Sources
Data Source Updates
Fixing Broken Connections
Performance
Optimize Individual Queries
Caching
Create Limits
Scheduling
Upgrade Snippets to Views
STAGE 3 DATA WAREHOUSE aka the Single Source of Truth
Warehouse Stage Overview
ZenHubInterAppForceDB
Agile Warehouse Building with Layers
Chapter Nine The Power of Layers and Views
Make Readable Views
Layer Views on Views
Start with a Single View
Chapter Ten Staging Schemas
Orient to the Schemas
Pick a Table and Clean It
Eliminate Columns
Ensure Table and Column Names Are Understandable
Ensure Values Are Readable
Standardize Values
Extract Fields from Structured Columns (JSON)
Extract Data Using Wildcards
Decide What to Do about Null Fields
Filter Unwanted Rows
Put It All Together
Other Staging Modeling Considerations. NULLs
Ways of Replacing Nulls. String Values
Numbers
Dates
When Not to Replace
Date and Time
SQL Comments
Run the Model
Building on Top of Staging Schemas
Chapter Eleven Model Data with dbt
Version Control
Modularity and Reusability
Package Management
Organizing Files
Macros
Incremental Tables
Testing
Uniqueness
Not Null
Relationships
Accepted Values
Chapter Twelve Deploy Modeling Code
Branch Using Version Control Software
Commit Message
Test Locally
Code Review
Schedule Runs
Chapter Thirteen Implementing the Data Warehouse
Manage Dependencies
Combine Tables Within Schemas
Combine Tables Across Schemas
Keep the Grain Consistent
Create Business Metrics
Adding Calculated Fields to Tables
Avoid Aggregated Metric Tables
Create a Standard Metrics Dashboard
Keeping Accurate History
Snapshots
Keep a Data Log
Chapter Fourteen Managing Data Access
How to Secure Sensitive Data in the Data Warehouse
Read‐Only
Custom User Groups
Hash Sensitive Data
Pseudo‐Anonymize Data
How to Secure Sensitive Data in a BI Tool
Consistent Account Audit/Cleanup
Chapter Fifteen Maintaining the Source of Truth
Track New Metrics
Three Considerations for Adding New Calculated Fields
1. Where It Should Live
2. How to Backfill It
3. What to Call It
Deprecate Old Metrics
Explicitly Deprecate Data
Staging Schemas Make Source Changes Easy
Deprecate Old Schemas
Resolve Conflicting Numbers
Handling Ongoing Requests and Ongoing Feedback
Updating Modeling Code
Manage Access
How to Handle Permissions Correctly
Tuning to Optimize
Add Clusters
Materialize Expensive Views
Code Review All Modeling
Maintenance Checklist
STAGE 4 DATA MARTS aka Data Democratized
Mart Stage Overview
Stage 4: Data Marts
Overview of Building Data Marts
Center Around Business Use Cases
Chapter Sixteen Data Mart Implementation
Views on the Data Warehouse
Segment Tables
Access Update
Chapter Seventeen Data Mart Maintenance
Educate Team
Identifies Issues
Spikes
No Spikes
No New Data
All Queries on a Data Source Are Erroring Out
Slow Queries
Identify New Needs
Help Track Success
Chapter Eighteen Modern versus Traditional Data Stacks: What's Changed? What's Changed?
Storage and Compute Continues to Drop in Price Rapidly
Data Lakes Were Added to the Stack and ELT Replaced ETL
Data Is Increasingly Democratized
Chapter Nineteen Row‐ versus Column‐Oriented Database
Row‐Oriented Databases
Writing to Row Store Databases
Reading from Row Store Databases
Extra Data into Memory
Column‐Oriented Databases
Writing to a Column Store Databases
Reading from a Column Store Database
Coding the Data into More Compact Forms
Compressing the Data
Ordering the Data
Summary
Chapter Twenty Style Guide Example
Simplify
Only Include Fields That Have an Apparent Analytical Purpose
Extract Relevant Data from Complex Data Tpes
Change Flags and Cryptic Abbreviations to Meaningful Values
Denormalize Where Beneficial
Clean
Attempt to Eliminate NULLs
Fix Common Data Inconsistencies
Remove Irrelevant Data
Standardize Data Types
Naming Conventions
Plural Table Names
Singular Column Names
id as Primary Key
Lowercase, Underscored Naming
Share It
Chapter Twenty-One Building an SST Example
First Attempt—Same Tables with Prefixes
What Tables Did We Keep?
What Tables Did We Not Include?
What Columns Did We Keep?
What Was Not Ideal About It?
What Was the Feedback?
Second Attempt—Operational Schema (Source Agnostic)
What Tables Did We Keep?
How Did We Make These Decisions?
What Was Not Ideal About It?
What Was the Feedback?
Third Attempt—Application Separate, Other Sources Smashed
What Tables Did We Keep?
What Columns Did We Keep?
What Was the Feedback?
Less Planning, More Implementing
Acknowledgments and Contributions
Thank‐yous
Index
WILEY END USER LICENSE AGREEMENT
Отрывок из книги
Dave FowlerMatt David
— Dave Fowler
.....
This book aims to provide a broad overview and general guidelines on how to set up a data stack. We intentionally gloss over the details of launching a Redshift instance, writing SQL, or using various BI products. That would clutter the text, repeat what's already on the internet, and make the read quite stale.
The book starts with a quick overview and decision charts about what the stages are and what stage is appropriate for you. This book is structured with a section for each of the four stages, and if you'd like, you can jump ahead to the stage you're at.
.....