Official Google Cloud Certified Professional Data Engineer Study Guide
Реклама. ООО «ЛитРес», ИНН: 7719571260.
Оглавление
Dan Sullivan. Official Google Cloud Certified Professional Data Engineer Study Guide
Official Google Cloud Certified Professional Data Engineer. Study Guide
Acknowledgments
About the Author
About the Technical Editor
CONTENTS
List of Tables
List of Illustrations
Guide
Pages
Introduction
What Does This Book Cover?
Interactive Online Learning Environment and TestBank
Additional Resources
Objective Map
Assessment Test
Answers to Assessment Test
Chapter 1 Selecting Appropriate Storage Technologies
From Business Requirements to Storage Systems
Ingest
Application Data
Streaming Data
Batch Data
Store
Data Access Patterns
Access Controls
Time to Store
Process and Analyze
Data Transformations
Data Analysis
Explore and Visualize
Technical Aspects of Data: Volume, Velocity, Variation, Access, and Security
Volume
Velocity
Variation in Structure
Data Access Patterns
Security Requirements
Types of Structure: Structured, Semi-Structured, and Unstructured
Structured: Transactional vs. Analytical
Semi-Structured: Fully Indexed vs. Row Key Access
Fully Indexed, Semi-Structured Data
Row Key Access
Unstructured Data
Google’s Storage Decision Tree
Schema Design Considerations
Relational Database Design
OLTP
OLAP
NoSQL Database Design
Key-Value Data Stores
Document Databases
Wide-Column Databases
Graph Databases
Exam Essentials
Review Questions
Chapter 2 Building and Operationalizing Storage Systems
Cloud SQL
Configuring Cloud SQL
Improving Read Performance with Read Replicas
Importing and Exporting Data
Cloud Spanner
Configuring Cloud Spanner
Replication in Cloud Spanner
Database Design Considerations
Importing and Exporting Data
Cloud Bigtable
Configuring Bigtable
Database Design Considerations
Importing and Exporting
Cloud Firestore
Cloud Firestore Data Model
Indexing and Querying
Importing and Exporting
BigQuery
BigQuery Datasets
Loading and Exporting Data
Clustering, Partitioning, and Sharding Tables
Streaming Inserts
Monitoring and Logging in BigQuery
BigQuery Cost Considerations
Tips for Optimizing BigQuery
Cloud Memorystore
Cloud Storage
Organizing Objects in a Namespace
Storage Tiers
Cloud Storage Use Cases
Data Retention and Lifecycle Management
Unmanaged Databases
Exam Essentials
Review Questions
Chapter 3 Designing Data Pipelines
Overview of Data Pipelines
Data Pipeline Stages
Ingestion
Transformation
Storage
Analysis
Types of Data Pipelines
Data Warehousing Pipelines
Extract, Transformation, and Load
Extract, Load, and Transformation
Extraction and Load
Change Data Capture
Stream Processing Pipelines
Event Time and Processing Time
Sliding and Tumbling Windows
Late Arriving and Watermarks
Hot Path and Cold Path Ingestion
Machine Learning Pipelines
GCP Pipeline Components
Cloud Pub/Sub
Working with Messaging Queues
Open Source Alternative: Kafka
Cloud Dataflow
Cloud Dataflow Concepts
Jobs and Templates
Cloud Dataproc
Managing Data in Cloud Dataproc
Configuring a Cloud Dataproc Cluster
Submitting a Job
Cloud Composer
Migrating Hadoop and Spark to GCP
Exam Essentials
Review Questions
Chapter 4 Designing a Data Processing Solution
Designing Infrastructure
Choosing Infrastructure
Compute Engine
Kubernetes Engine
App Engine
Cloud Functions
Availability, Reliability, and Scalability of Infrastructure
Making Compute Resources Available, Reliable, and Scalable
Compute Engine
Kubernetes Engine
App Engine and Cloud Functions
Making Storage Resources Available, Reliable, and Scalable
Making Network Resources Available, Reliable, and Scalable
Hybrid Cloud and Edge Computing
Analytics Hybrid Cloud
Edge Cloud
Designing for Distributed Processing
Distributed Processing: Messaging
Message Brokers
Message Queues
Event Processing Models
Distributed Processing: Services
Service-Oriented Architectures
Microservices
Serverless Functions
Migrating a Data Warehouse
Assessing the Current State of a Data Warehouse
Technical Requirements
Business Benefits
Designing the Future State of a Data Warehouse
Migrating Data, Jobs, and Access Controls
Validating the Data Warehouse
Exam Essentials
Review Questions
Chapter 5 Building and Operationalizing Processing Infrastructure
Provisioning and Adjusting Processing Resources
Provisioning and Adjusting Compute Engine
Provisioning Single VM Instances
Provisioning Managed Instance Groups
Adjusting Compute Engine Resources to Meet Demand
Provisioning and Adjusting Kubernetes Engine
Overview of Kubernetes Architecture
Provisioning a Kubernetes Engine Cluster
Adjusting Kubernetes Engine Resources to Meet Demand
Autoscaling Applications in Kubernetes Engine
Autoscaling Clusters in Kubernetes Engine
Kubernetes YAML Configurations
Provisioning and Adjusting Cloud Bigtable
Provisioning Bigtable Instances
Replication in Bigtable
Provisioning and Adjusting Cloud Dataproc
Configuring Cloud Dataflow
Configuring Managed Serverless Processing Services
Configuring App Engine
Configuring Cloud Functions
Monitoring Processing Resources
Stackdriver Monitoring
Stackdriver Logging
Stackdriver Trace
Exam Essentials
Review Questions
Chapter 6 Designing for Security and Compliance
Identity and Access Management with Cloud IAM
Predefined Roles
Custom Roles
Using Roles with Service Accounts
Access Control with Policies
Using IAM with Storage and Processing Services
Cloud Storage and IAM
Cloud Bigtable and IAM
BigQuery and IAM
Cloud Dataflow and IAM
Data Security
Encryption
Encryption at Rest
Encryption in Transit
Key Management
Default Key Management
Customer-Managed Encryption Keys
Customer-Supplied Encryption Keys
Ensuring Privacy with the Data Loss Prevention API
Detecting Sensitive Data
Running Data Loss Prevention Jobs
Inspection Best Practices
Legal Compliance
Health Insurance Portability and Accountability Act (HIPAA)
Children’s Online Privacy Protection Act
FedRAMP
General Data Protection Regulation
Exam Essentials
Review Questions
Chapter 7 Designing Databases for Reliability, Scalability, and Availability
Designing Cloud Bigtable Databases for Scalability and Reliability
Data Modeling with Cloud Bigtable
Designing Row-keys
Row-key Design Best Practices
Antipatterns for Row-key Design
Key Visualizer
Designing for Time Series
Use Replication for Availability and Scalability
Designing Cloud Spanner Databases for Scalability and Reliability
Relational Database Features
Interleaved Tables
Primary Keys and Hotspots
Database Splits
Secondary Indexes
Query Best Practices
Use Query Parameters
Use EXPLAIN PLAN to Understand Execution Plans
Avoid Long Locks
Designing BigQuery Databases for Data Warehousing
Schema Design for Data Warehousing
Types of Analytical Datastores
Projects, Datasets, and Tables
Clustered and Partitioned Tables
Partitioning
Clustering
Querying Data in BigQuery
External Data Access
Querying Cloud Bigtable Data from BigQuery
Querying Cloud Storage Data from BigQuery
Querying Google Drive Data from BigQuery
BigQuery ML
Exam Essentials
Review Questions
Chapter 8 Understanding Data Operations for Flexibility and Portability
Cataloging and Discovery with Data Catalog
Searching in Data Catalog
Tagging in Data Catalog
Data Preprocessing with Dataprep
Cleansing Data
Discovering Data
Enriching Data
Importing and Exporting Data
Structuring and Validating Data
Visualizing with Data Studio
Connecting to Data Sources
Visualizing Data
Sharing Data
Exploring Data with Cloud Datalab
Jupyter Notebooks
Managing Cloud Datalab Instances
Adding Libraries to Cloud Datalab Instances
Orchestrating Workflows with Cloud Composer
Airflow Environments
Creating DAGs
Airflow Logs
Exam Essentials
Review Questions
Chapter 9 Deploying Machine Learning Pipelines
Structure of ML Pipelines
Data Ingestion
Batch Data Ingestion
Streaming Data Ingestion
Data Preparation
Data Exploration
Data Transformation
Feature Engineering
Data Segregation
Training Data
Validation Data
Test Data
Model Training
Feature Selection
Underfitting, Overfitting, and Regularization
Model Evaluation
Individual Evaluation Metrics
K-Fold Cross Validation
Confusion Matrices
Bias and Variance
Model Deployment
Model Monitoring
GCP Options for Deploying Machine Learning Pipeline
Cloud AutoML
BigQuery ML
Kubeflow
Spark Machine Learning
Exam Essentials
Review Questions
Chapter 10 Choosing Training and Serving Infrastructure
Hardware Accelerators
Graphics Processing Units
Tensor Processing Units
Choosing Between CPUs, GPUs, and TPUs
Distributed and Single Machine Infrastructure
Single Machine Model Training
Distributed Model Training
Serving Models
Edge Computing with GCP
Edge Computing Overview
Edge Computing Components and Processes
Edge TPU
Cloud IoT
Exam Essentials
Review Questions
Chapter 11 Measuring, Monitoring, and Troubleshooting Machine Learning Models
Three Types of Machine Learning Algorithms
Supervised Learning
Classification
Regression
Unsupervised Learning
Anomaly Detection
Reinforcement Learning
Deep Learning
Engineering Machine Learning Models
Model Training and Evaluation
Data Collection and Preparation
Feature Engineering
Training Models
Evaluating Models
Accuracy
Precision
Recall
F1 Score
Operationalizing ML Models
Deploying Models
Model Serving
Monitoring
Retraining
Common Sources of Error in Machine Learning Models
Data Quality
Unbalanced Training Sets
Types of Bias
Exam Essentials
Review Questions
Chapter 12 Leveraging Prebuilt Models as a Service
Sight
Vision AI
Video AI
Conversation
Dialogflow
Cloud Text-to-Speech API
Cloud Speech-to-Text API
Language
Translation
Natural Language
Structured Data
Recommendations AI API
Cloud Inference API
Exam Essentials
Review Questions
Appendix Answers to Review Questions. Chapter 1: Selecting Appropriate Storage Technologies
Chapter 2: Building and Operationalizing Storage Systems
Chapter 3: Designing Data Pipelines
Chapter 4: Designing a Data Processing Solution
Chapter 5: Building and Operationalizing Processing Infrastructure
Chapter 6: Designing for Security and Compliance
Chapter 7: Designing Databases for Reliability, Scalability, and Availability
Chapter 8: Understanding Data Operations for Flexibility and Portability
Chapter 9: Deploying Machine Learning Pipelines
Chapter 10: Choosing Training and Serving Infrastructure
Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models
Chapter 12: Leveraging Prebuilt Models as a Service
Index. A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W–Z
Online Test Bank
WILEY END USER LICENSE AGREEMENT
Отрывок из книги
Dan Sullivan
Carole Jelen, vice president of Waterside Productions, and Jim Minatel, associate publisher at John Wiley & Sons, continue to lead the effort to create Google Cloud certification guides. It was a pleasure to work with Gary Schwartz, project editor, who managed the process that got us from outline to a finished manuscript. Thanks to Christine O’Connor, senior production editor, for making the last stages of book development go as smoothly as they did.
.....
Different storage systems will have different levels of access controls. Cloud Storage, for example, can have access controls at the bucket and the object level. If someone has access to a file in Cloud Storage, they will have access to all the data in that file. If some users have access only to a subset of a dataset, then the data could be stored in a relational database and a view could be created that includes only the data that the user is allowed to access.
Encrypting data at rest is an important requirement for many use cases; fortunately, all Google Cloud storage services encrypt data at rest.
.....