Data Science
Реклама. ООО «ЛитРес», ИНН: 7719571260.
Оглавление
Field Cady. Data Science
Table of Contents
List of Tables
List of Illustrations
Guide
Pages
Data Science: The Executive Summary. A Technical Book for Non-Technical Professionals
Copyright
1 Introduction. 1.1 Why Managers Need to Know About Data Science
1.2 The New Age of Data Literacy
1.3 Data‐Driven Development
1.4 How to Use this Book
2 The Business Side of Data Science
2.1 What Is Data Science?
2.1.1 What Data Scientists Do
2.1.2 History of Data Science
2.1.3 Data Science Roadmap
2.1.4 Demystifying the Terms: Data Science, Machine Learning, Statistics, and Business Intelligence
2.1.4.1 Machine Learning
2.1.4.2 Statistics
2.1.4.3 Business Intelligence
2.1.5 What Data Scientists Don't (Necessarily) Do
2.1.5.1 Working Without Data
2.1.5.2 Working with Data that Can't Be Interpreted
2.1.5.3 Replacing Subject Matter Experts
2.1.5.4 Designing Mathematical Algorithms
2.2 Data Science in an Organization
2.2.1 Types of Value Added
2.2.1.1 Business Insights
2.2.1.2 Intelligent Products
2.2.1.3 Building Analytics Frameworks
2.2.1.4 Offline Batch Analytics
2.2.2 One‐Person Shops and Data Science Teams
2.2.3 Related Job Roles
2.2.3.1 Data Engineer
2.2.3.2 Data Analyst
2.2.3.3 Software Engineer
2.3 Hiring Data Scientists
2.3.1 Do I Even Need Data Science?
2.3.2 The Simplest Option: Citizen Data Scientists
2.3.3 The Harder Option: Dedicated Data Scientists
2.3.4 Programming, Algorithmic Thinking, and Code Quality
2.3.5 Hiring Checklist
2.3.6 Data Science Salaries
2.3.7 Bad Hires and Red Flags
2.3.8 Advice with Data Science Consultants
2.4 Management Failure Cases
2.4.1 Using Them as Devs
2.4.2 Inadequate Data
2.4.3 Using Them as Graph Monkeys
2.4.4 Nebulous Questions
2.4.5 Laundry Lists of Questions Without Prioritization
Glossary
3 Working with Modern Data
3.1 Unstructured Data and Passive Collection
3.2 Data Types and Sources
3.3 Data Formats
3.3.1 CSV Files
3.3.2 JSON Files
3.3.3 XML and HTML
3.4 Databases
3.4.1 Relational Databases and Document Stores
3.4.2 Database Operations
3.5 Data Analytics Software Architectures
3.5.1 Shared Storage
3.5.2 Shared Relational Database
3.5.3 Document Store + Analytics RDB
3.5.4 Storage + Parallel Processing
Glossary
Notes
4 Telling the Story, Summarizing Data
4.1 Choosing What to Measure
4.2 Outliers, Visualizations, and the Limits of Summary Statistics: A Picture Is Worth a Thousand Numbers
4.3 Experiments, Correlation, and Causality
4.4 Summarizing One Number
4.5 Key Properties to Assess: Central Tendency, Spread, and Heavy Tails. 4.5.1 Measuring Central Tendency
4.5.1.1 Mean
4.5.1.2 Median
4.5.1.3 Mode
4.5.2 Measuring Spread
4.5.2.1 Standard Deviation
4.5.2.2 Percentiles
4.5.3 Advanced Material: Managing Heavy Tails
4.6 Summarizing Two Numbers: Correlations and Scatterplots
4.6.1 Correlations
4.6.1.1 Pearson Correlation
4.6.1.2 Ordinal Correlations
4.6.2 Mutual Information
4.7 Advanced Material: Fitting a Line or Curve
4.7.1 Effects of Outliers
4.7.2 Optimization and Choosing Cost Functions
4.8 Statistics: How to Not Fool Yourself
4.8.1 The Central Concept: The p‐Value
4.8.2 Reality Check: Picking a Null Hypothesis and Modeling Assumptions
4.8.3 Advanced Material: Parameter Estimation and Confidence Intervals
4.8.4 Advanced Material: Statistical Tests Worth Knowing
4.8.4.1 χ2‐Test
4.8.4.2 T‐test
4.8.4.3 Fisher's Exact Test
4.8.4.4 Multiple Hypothesis Testing
4.8.5 Bayesian Statistics
4.9 Advanced Material: Probability Distributions Worth Knowing
4.9.1 Probability Distributions: Discrete and Continuous
4.9.2 Flipping Coins: Bernoulli Distribution
4.9.3 Adding Coin Flips: Binomial Distribution
4.9.4 Throwing Darts: Uniform Distribution
4.9.5 Bell‐Shaped Curves: Normal Distribution
4.9.6 Heavy Tails 101: Log‐Normal Distribution
4.9.7 Waiting Around: Exponential Distribution and the Geometric Distribution
4.9.8 Time to Failure: Weibull Distribution
4.9.9 Counting Events: Poisson Distribution
Glossary
5 Machine Learning
5.1 Supervised Learning, Unsupervised Learning, and Binary Classifiers
5.1.1 Reality Check: Getting Labeled Data and Assuming Independence
5.1.2 Feature Extraction and the Limitations of Machine Learning
5.1.3 Overfitting
5.1.4 Cross‐Validation Strategies
5.2 Measuring Performance
5.2.1 Confusion Matrices
5.2.2 ROC Curves
5.2.3 Area Under the ROC Curve
5.2.4 Selecting Classification Cutoffs
5.2.5 Other Performance Metrics
5.2.6 Lift Curves
5.3 Advanced Material: Important Classifiers
5.3.1 Decision Trees
5.3.2 Random Forests
5.3.3 Ensemble Classifiers
5.3.4 Support Vector Machines
5.3.5 Logistic Regression
5.3.6 Lasso Regression
5.3.7 Naive Bayes
5.3.8 Neural Nets
5.4 Structure of the Data: Unsupervised Learning
5.4.1 The Curse of Dimensionality
5.4.2 Principal Component Analysis and Factor Analysis
5.4.2.1 Scree Plots and Understanding Dimensionality
5.4.2.2 Factor Analysis
5.4.2.3 Limitations of PCA
5.4.3 Clustering
5.4.3.1 Real‐World Assessment of Clusters
5.4.3.2 k‐means Clustering
5.4.3.3 Advanced Material: Other Clustering Algorithms. Gaussian Mixture Models
Agglomerative Clustering
5.4.3.4 Advanced Material: Evaluating Cluster Quality
SiIhouette Score
Rand Index and Adjusted Rand Index
Mutual Information
5.5 Learning as You Go: Reinforcement Learning
5.5.1 Multi‐Armed Bandits and ε‐Greedy Algorithms
5.5.2 Markov Decision Processes and Q‐Learning
Glossary
6 Knowing the Tools
6.1 A Note on Learning to Code
6.2 Cheat Sheet
6.3 Parts of the Data Science Ecosystem
6.3.1 Scripting Languages
6.3.2 Technical Computing Languages
6.3.2.1 Python's Technical Computing Stack
6.3.2.2 R
6.3.2.3 Matlab and Octave
6.3.2.4 Mathematica
6.3.2.5 SAS
6.3.2.6 Julia
6.3.3 Visualization
6.3.3.1 Tableau
6.3.3.2 Excel
6.3.3.3 D3.js
6.3.4 Databases
6.3.5 Big Data
6.3.5.1 Types of Big Data Technologies
6.3.5.2 Spark
6.3.6 Advanced Material: The Map‐Reduce Paradigm
6.4 Advanced Material: Database Query Crash Course
6.4.1 Basic Queries
6.4.2 Groups and Aggregations
6.4.3 Joins
6.4.4 Nesting Queries
Glossary
7 Deep Learning and Artificial Intelligence
7.1 Overview of AI. 7.1.1 Don't Fear the Skynet: Strong and Weak AI
7.1.2 System 1 and System 2
7.2 Neural Networks. 7.2.1 What Neural Nets Can and Can't Do
7.2.2 Enough Boilerplate: What's a Neural Net?
7.2.3 Convolutional Neural Nets
7.2.4 Advanced Material: Training Neural Networks
7.2.4.1 Manual Versus Automatic Feature Extraction
7.2.4.2 Dataset Sizes and Data Augmentation
7.2.4.3 Batches and Epochs
7.2.4.4 Transfer Learning
7.2.4.5 Feature Extraction
7.2.4.6 Word Embeddings
7.3 Natural Language Processing
7.3.1 The Great Divide: Language Versus Statistics
7.3.2 Save Yourself Some Trouble: Consider Regular Expressions
7.3.3 Software and Datasets
7.3.4 Key Issue: Vectorization
7.3.5 Bag‐of‐Words
7.4 Knowledge Bases and Graphs
Glossary
Postscript
Index
WILEY END USER LICENSE AGREEMENT
Отрывок из книги
Field Cady
And for my son Cyrus, who entered shortly thereafter.
.....
So where is all of this leading? Cutting out hyperbole and speculation, what does it look like for an organization to make full use of modern data technologies and what are the benefits? The goal that we are pushing toward is what I call “data‐driven development” (DDD). In an organization that uses DDD, all stages in a business process have their data gathered, modeled, and deployed to enable better decision making. Overall business goals and workflows are crafted by human experts, but after that every part of the system can be monitored and optimized, hypotheses can be tested rigorously and retroactively, and large‐scale trends can be identified and capitalized on. Data greases the wheels of all parts of the operation and provides a constant pulse on what's happening on the ground.
I break the benefits of DDD into three major categories:
.....