Читать книгу Data Science For Dummies - Lillian Pierson - Страница 29

Dealing with data variety

Оглавление

Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from user clicks on a web page — otherwise known as click-streams.

Structured data can be stored, processed, and manipulated in a traditional relational database management system (RDBMS) — an example of this would be a PostgreSQL database that uses a tabular schema of rows and columns, making it easier to identify specific values within data that’s stored within the database. This data, which can be generated by humans or machines, is derived from all sorts of sources — from click-streams and web-based forms to point-of-sale transactions and sensors. Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data can be derived from blog posts, emails, and Word documents. Semistructured data doesn’t fit into a structured database system, but is nonetheless structured, by tags that are useful for creating a form of order and hierarchy in the data. Semistructured data is commonly found in databases and file systems. It can be stored as log files, XML files, or JSON data files.

Become familiar with the term data lake — this term is used by practitioners in the big data industry to refer to a nonhierarchical data storage system that’s used to hold huge volumes of multistructured, raw data within a flat storage architecture — in other words, a collection of records that come in uniform format and that are not cross-referenced in any way. HDFS can be used as a data lake storage repository, but you can also use the Amazon Web Services (AWS) S3 platform — or a similar cloud storage solution — to meet the same requirements on the cloud. (The Amazon Web Services S3 platform is one of the more popular cloud architectures available for storing big data.)

Although both data lake and data warehouse are used for storing data, the terms refer to different types of systems. Data lake was defined above and a data warehouse is a centralized data repository that you can use to store and access only structured data. A more traditional data warehouse system commonly employed in business intelligence solutions is a data mart — a storage system (for structured data) that you can use to store one particular focus area of data, belonging to only one line of business in the company.

Data Science For Dummies

Подняться наверх