Читать книгу Data Science For Dummies - Lillian Pierson - Страница 14

Collecting, querying, and consuming data

Data engineers have the job of capturing and collating large volumes of structured, unstructured, and semi structured big data — an outdated term that’s used to describe data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it lacks the structural requirements of traditional database architectures. Again, data engineering tasks are separate from the work that’s performed in data science, which focuses more on analysis, prediction, and visualization. Despite this distinction, whenever data scientists collect, query, and consume data during the analysis process, they perform work similar to that of the data engineer (the role I tell you about earlier in this chapter).

Although valuable insights can be generated from a single data source, often the combination of several relevant sources delivers the contextual information required to drive better data-informed decisions. A data scientist can work from several datasets that are stored in a single database, or even in several different data storage environments. At other times, source data is stored and processed on a cloud-based platform built by software and data engineers.

No matter how the data is combined or where it’s stored, if you’re a data scientist, you almost always have to query data — write commands to extract relevant datasets from data storage systems, in other words. Most of the time, you use Structured Query Language (SQL) to query data. (Chapter 7 is all about SQL, so if the acronym scares you, jump ahead to that chapter now.)

Whether you’re using a third-party application or doing custom analyses by using a programming language such as R or Python, you can choose from a number of universally accepted file formats:

Comma-separated values (CSV): Almost every brand of desktop and web-based analysis application accepts this file type, as do commonly used scripting languages such as Python and R.

Script: Most data scientists know how to use either the Python or R programming language to analyze and visualize data. These script files end with the extension .ply or .ipynb (Python) or .r (R).

Application: Excel is useful for quick-and-easy, spot-check analyses on small- to medium-size datasets. These application files have the .xls or .xlsx extension.

Web programming: If you're building custom, web-based data visualizations, you may be working in D3.js — or data-driven documents, a JavaScript library for data visualization. When you work in D3.js, you use data to manipulate web-based documents using .html, .svg, and .css files.

Подняться наверх