Читать книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan - Страница 30

Data Analysis

Оглавление

In the analyze stage, a variety of techniques may be used to extract useful information from data. Statistical techniques are often used with numeric data to do the following:

 Describe characteristics of a dataset, such as a mean and standard deviation of the dataset.

 Generate histograms to understand the distribution of values of an attribute.

 Find correlations between variables, such as customer type and average revenue per sales order.

 Make predictions using regression models, which allow you to estimate one attribute based on the value of another. In statistical terms, regression models generate predictions of a dependent variable based on the value of an independent variable.

 Cluster subsets of a dataset into groups of similar entities. For example, a retail sales dataset may yield groups of customers who purchase similar types of products and spend similar amounts over time.

Text data can be analyzed as well using a variety of techniques. A simple example is counting the occurrences of each word in a text. A more complex example is extracting entities, such as names of persons, businesses, and locations, from a document.

Cloud Dataflow, Cloud Dataproc, BigQuery, and Cloud ML Engine are all useful for data analysis.

Official Google Cloud Certified Professional Data Engineer Study Guide

Подняться наверх