Data Lakes For Dummies
Реклама. ООО «ЛитРес», ИНН: 7719571260.
Оглавление
Alan R. Simon. Data Lakes For Dummies
Data Lakes For Dummies® To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Data Lakes For Dummies Cheat Sheet” in the Search box. Table of Contents
List of Tables
List of Illustrations
Guide
Pages
Introduction
About This Book
Foolish Assumptions
Icons Used in This Book
Beyond the Book
Where to Go from Here
Getting Started with Data Lakes
Jumping into the Data Lake
What Is a Data Lake?
Rock-solid water
A really great lake
Expanding the data lake
More than just the water
Different types of data
Structured data: Staying in your own lane
Unstructured data: A picture may be worth ten million words
Semi-structured data: Stuck in the middle of the lake
Different water, different data
Refilling the data lake
Everyone visits the data lake
The Data Lake Olympics
The bronze zone
The silver zone
The gold zone
LINKING THE DATA LAKE ZONES TOGETHER
The sandbox
Data Lakes and Big Data
THE THREE (OR FOUR OR FIVE OR MORE) VS OF BIG DATA AND DATA LAKES
The Data Lake Water Gets Murky
BACK TO THE FUTURE WITH NAME CHANGES
Planning Your Day (and the Next Decade) at the Data Lake
Carpe Diem: Seizing the Day with Big Data
Managing Equal Opportunity Data
BACK TO THE FUTURE, PART 2
Building Today’s — and Tomorrow’s — Enterprise Analytical Data Environment
Constructing a bionic data environment
Strengthening the analytics relationship between IT and the business
Reducing Existing Stand-Alone Data Marts
Dealing with the data fragmentation problem
Decision point: Retire, isolate, or incorporate?
Data mart retirement
Data mart isolation
Data mart incorporation
Eliminating Future Stand-Alone Data Marts
Establishing a blockade
Providing a path of least resistance
Establishing a Migration Path for Your Data Warehouses
Sending a faithful data warehouse off to a well-deserved retirement
Resettling a data warehouse into your data lake environment
Aligning Data with Decision Making
Deciding what your organization wants out of analytics
Mapping your analytics needs to your data lake road map
Building the best data pipelines inside your data lake
Addressing future gaps and shortfalls
Speedboats, Canoes, and Lake Cruises: Traversing the Variable-Speed Data Lake
Managing Overall Analytical Costs
Break Out the Life Vests: Tackling Data Lake Challenges
That’s Not a Data Lake, This Is a Data Lake!
Dealing with conflicting definitions and boundaries
Data lake cousins
The cloud database
A nice house at the lake
Data hubs
Data fabric and data mesh
Exposing Data Lake Myths and Misconceptions
Misleading data lake campaign slogans
The single-platform misconception
No upfront data analysis required
The false tale of the tortoise and the data lake
Navigating Your Way through the Storm on the Data Lake
Building the Data Lake of Dreams
DATA DUMP OR DATA SWAMP?
Performing Regular Data Lake Tune-ups — Or Else!
Technology Marches Forward
Building the Docks, Avoiding the Rocks
Imprinting Your Data Lake on a Reference Architecture
Playing Follow the Leader
Guiding Principles of a Data Lake Reference Architecture
A Reference Architecture for Your Data Lake Reference Architecture
Incoming! Filling Your Data Lake
Supporting the Fleet Sailing on Your Data Lake
Objects floating in your data lake
SOME SCOPING FOR ADLS
Mixing it up
The Old Meets the New at the Data Lake
Keeping the shiny parts of the data warehouse
Flooding the data warehouse
Using your data lake as a supersized staging layer
Split-streaming your inbound data along two paths
Which is the bigger breadbox?
Bringing Outside Water into Your Data Lake
Streaming versus batch external data feeds
Ingestion versus as-needed external data access
FISHING IN THE AWS DATA EXCHANGE
Playing at the Edge of the Lake
Anybody Hungry? Ingesting and Storing Raw Data in Your Bronze Zone
Ingesting Data with the Best of Both Worlds
Row, row, row your data, gently down the stream
Supplementing your streaming data with batch data
The gray area between streaming and batch
Joining the Data Ingestion Fraternity
Following the Lambda architecture
Using the Kappa architecture
Storing Data in Your Bronze Zone
Implementing a monolithic bronze zone
Building a multi-component bronze zone
Coordinating your bronze zone with your silver and gold zones
Just Passing Through: The Cross-Zone Express Lane
Taking Inventory at the Data Lake
Bringing Analytics to Your Bronze Zone
Turning your experts loose
Taking inventory in the bronze zone
Getting a leg up on data governance
Your Data Lake’s Water Treatment Plant: The Silver Zone
Funneling Data further into the Data Lake
Sprucing up your raw data
Refining your raw data
Enriching your raw data
Bringing Master Data into Your Data Lake
Impacting the Bronze Zone
Deciding whether to leave a forwarding address
Deciding whether to retain your raw data
Getting Clever with Your Storage Options
Working Hand-in-Hand with Your Gold Zone
Bottling Your Data Lake Water in the Gold Zone
Laser-Focusing on the Purpose of the Gold Zone
Looking Inside the Gold Zone
Object stores
Databases
Persistent streaming data
Specialized data stores
Deciding What Data to Curate in Your Gold Zone
Seeing What Happens When Your Curated Data Becomes Less Useful
Playing in the Sandbox
Developing New Analytical Models in Your Sandbox
Comparing Different Data Lake Architectural Options
Experimenting and Playing Around with Data
Fishing in the Data Lake
Starting with the Latest Guidebook
Setting up role-based data lake access
Setting up usage-style data lake access
Taking It Easy at the Data Lake
Staying in Your Lane
Doing a Little Bit of Exploring
Putting on Your Gear and Diving Underwater
Rowing End-to-End across the Data Lake
Keeping versus Discarding Data Components
Getting Started with Your Data Lake
Shifting Your Focus to Data Ingestion
Breaking through the ingestion congestion
Cranking up the data refinery
Adding to your data pipelines
Finishing Up with the Sandbox
Evaporating the Data Lake into the Cloud
A Cloudy Day at the Data Lake
Rushing to the Cloud
The pendulum swings back and forth
CLOUD DATA LAKES IN THE DISCO ERA (SORT OF)
Dealing with the challenges of on-premises hosting
The case for the cloud
Running through Some Cloud Computing Basics
Public, private, and hybrid clouds
Different “as a service” models
The Big Guys in the Cloud Computing Game
Building Data Lakes in Amazon Web Services
The Elite Eight: Identifying the Essential Amazon Services
Amazon S3
AWS Glue
AWS Lake Formation
Amazon Kinesis Data Streams
Amazon Kinesis Data Firehose
Amazon Athena
Amazon Redshift
Amazon Redshift Spectrum
Looking at the Rest of the Amazon Data Lake Lineup
AWS Lambda
Amazon EMR
Amazon SageMaker
Amazon Aurora
Amazon DynamoDB
Even more AWS databases
WHY SO MANY AWS DATABASES?
Building Data Pipelines in AWS
Building Data Lakes in Microsoft Azure
Setting Up the Big Picture in Azure
The Azure infrastructure
The 50,000-foot view of Azure data lakes
The Magnificent Seven, Azure Style
Azure Data Lake Storage Gen 2
BEWARE THE BLOB!
Azure Data Factory
Azure Databricks
Azure Event Hubs
Azure IoT Hub
Azure Cosmos DB
Azure ML
Filling Out the Azure Data Lake Lineup
Azure Stream Analytics
Microsoft Azure SQL Database
SQL Server Integration Services
Azure Analysis Services
Power BI
Azure HDInsight
Assembling the Building Blocks
General IoT analytics
Predictive maintenance for industrial IoT
DATA LAKES AND BUSINESS PROCESSES
Defect analysis and prevention
Rideshare company forecasting
Cleaning Up the Polluted Data Lake
Figuring Out If You Have a Data Swamp Instead of a Data Lake
Designing Your Report Card and Grading System
Looking at the Raw Data Lockbox
Knowing What to Do When Your Data Lake Is Out of Order
Too Fast, Too Slow, Just Right: Dealing with Data Lake Velocity and Latency
Dividing the Work in Your Component Architecture
Tallying Your Scores and Analyzing the Results
Defining Your Data Lake Remediation Strateg y
Setting Your Key Objectives
Going back to square one
Determining your enterprise analytics goals
Doing Your Gap Analysis
Identifying shortfalls and hot spots
Prioritizing issues and shortfalls
Identifying Resolutions
Knowing where your data lake needs to expand
Repairing the data lake boat docks
Linking analytics to data lake improvements
Establishing Timelines
Identifying critical business deadlines
Sequencing your upcoming data lake repairs
Looking for dependency and resource clashes
Defining Your Critical Success Factors
What does “success” mean?
What must be in place to enable success?
Refilling Your Data Lake
The Three S’s: Setting the Stage for Success
Refining and Enriching Existing Raw Data
Starting slowly
Adding more complexity
Making Better Use of Existing Refined Data
Building New Pipelines with Newly Ingested Raw Data
Making Trips to the Data Lake a Tradition
Checking Your GPS: The Data Lake Road Map
Getting an Overhead View of the Road to the Data Lake
Assessing Your Current State of Data and Analytics
Snorkeling through your enterprise analytics
Scoring your analytics continuum
Grading your breadth of data usage
Writing data-driven prescriptions
Receiving your final grades
Diving deep into your data architecture and governance
Scoring your analytical data landscape
Checking off the rules and regulations
Tallying up the score
Putting Together a Lofty Vision
Hot off the presses, straight from the lake: Writing a press release
Designing a slick sales brochure
Polishing the lenses of your data lake vision
Building Your Data Lake Architecture
Conceptual architecture
Implementation architecture
Deciding on Your Kickoff Activities
Expanding Your Data Lake
Booking Future Trips to the Data Lake
Searching for the All-in-One Data Lake
ACID EATS AWAY AT YOUR DATA CHALLENGES
Spreading Artificial Intelligence Smarts throughout Your Data Lake
Lining up your data
Shining a light into your analytics innards
Playing traffic cop
The Part of Tens
Top Ten Reasons to Invest in Building a Data Lake
Supporting the Entire Analytics Continuum
Bringing Order to Your Analytical Data throughout Your Enterprise
Retiring Aging Data Marts
Bringing Unfulfilled Analytics Ideas out of Dry Dock
Laying a Foundation for Future Analytics
Providing a Region for Experimentation
Improving Your Master Data Efforts
Opening Up New Business Possibilities
Keeping Up with the Competition
Getting Your Organization Ready for the Next Big Thing
Ten Places to Get Help for Your Data Lake
Cloud Provider Professional Services
Major Systems Integrators
Smaller Systems Integrators
Individual Consultants
Training Your Internal Staff
Industry Analysts
Data Lake Bloggers
Data Lake Groups and Forums
Data-Oriented Associations
Academic Resources
Ten Differences between a Data Warehouse and a Data Lake
Types of Data Supported
Data Volumes
Different Internal Data Models
Architecture and Topology
ETL versus ELT
Data Latency
Analytical Uses
Incorporating New Data Sources
User Communities
Hosting
Index. A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z
About the Author
Dedication
Author’s Acknowledgments
WILEY END USER LICENSE AGREEMENT
Отрывок из книги
In December 1995, I wrote an article for Database Programming & Design magazine entitled “I Want a Data Warehouse, So What Is It Again?” A few months later, I began writing Data Warehousing For Dummies (Wiley), building on the article’s content to help readers make sense of first-generation data warehousing.
Fast-forward a quarter of a century, and I could very easily write an article entitled “I Want a Data Lake, So What Is It Again?” This time, I’m cutting right to the chase with Data Lakes For Dummies. To quote a famous former baseball player named Yogi Berra, it’s déjà vu all over again!
.....
The operators of the resort could’ve said, “What the heck, let’s just have a free-for-all out on the lake and hope for the best.” Instead, they wisely established different zones for different purposes, resulting in orderly, peaceful vacations (hopefully!) rather than chaos.
A data lake is also divided into different zones. The exact number of zones may vary from one organization’s data lake to another’s, but you’ll always find at least three zones in use — bronze, silver, and gold — and sometimes a fourth zone, the sandbox.
.....