Читать книгу The Informed Company - Dave Fowler - Страница 16

Stage 2. Lake

Оглавление

Once companies must run analyses on multiple sources of data, each of which need joining, filtering, and manipulation, a company must move to a data lake. Blended data sources enable several actors in an organization to query a large subset of the company's complete data. In turn, funneling various sources into a data lake supports database performance at a reasonably large (not necessarily “big data”) scale.

A central motivation for a data lake lies with the need for piping data to business intelligence tools. For example, when working with data from Salesforce, Hubspot, Jira, and Zendesk, each service has its own in‐app dashboards and unique data application programming interfaces (APIs). Configuring input data streams for each business tool is a confusing, time‐consuming, and unsustainable workflow. It cannot really be done, especially at scale. Likewise, performing in‐house analyses across various sources can wildly complicate otherwise simple queries. On the other hand, having a data lake, which holds all relevant data in one place, allows analysts to use straightforward SQL queries to obtain business insights.

The central challenge faced by companies in the lake stage is knowing what toolset and methodology will unify and (safely) store your data. Companies looking to combine their data also run into performance issues, which we offer solutions to. And perhaps most important of all, choosing an architecture during lake development informs how easy (or hard) it will be for a company to build their future data warehouse.

This stage is right if:

 There's a need for unique or combined charts/dashboards for cloud application sources like Salesforce.

 A core set of people can learn the ins and outs of the structure of the messy data.

 You're intimidated by data modeling. (Don't be: that's why this book exists.)

 There's no time for even light data modeling.

 Large datasets need performant queries.

It's time for the next stage if:

 More than a few people are going to be working with this dataset.

 A clean source of truth would eliminate integrity issues.

 There's a need to separate the structure of the data from the always‐changing transactional sources.

 There's a need to adopt DRY (Don't Repeat Yourself) principles.

Modeling

Data requires transformation so that it is more usable by different people or systems within a database. Modeling refers to this process of making these transformations to the data.

DRY

An acronym that represents a software design ethos that avoids repetitive patterns through modularity of design.

The Informed Company

Подняться наверх