Читать книгу Data Lakes For Dummies - Alan R. Simon - Страница 21

Different water, different data

A common misconception is that you store “all your data” in your data lake. Actually, you store all or most of your analytic data in a data lake. Analytic data is, as you may suspect from the name, data that you’re using for analytics. In contrast, you use operational data to run your business.

What’s the difference? From one perspective, operational and analytic data are one and the same. Suppose you work for a large retailer. A customer comes into one of your stores and makes some purchases. Another customer goes onto your company’s website and buys some items there. The records of those sales — which customers made the purchases, which products they bought, how many of each product, the dates of the sales, whether the sales were online or in a store, and so on — are all stored away as official records of those transactions, which are necessary for running your company’s operations.

But you also want to analyze that data, right? You want to understand which products are selling the best and where. You want to understand which customers are spending the most. You have dozens or even hundreds of questions you want to ask about your customers and their purchasing activity.

Here’s the catch: You need to make copies of your operational data for the deep analysis that you need to undertake; and the copies of that operational data are what goes into the data lake (see Figure 1-4).

FIGURE 1-4: Source applications feeding data into your data lake.

Wait a minute! Why in the world do you need to copy data into your data lake? Why can’t you just analyze the data right where it is, in the source applications and their databases?

Data lakes, at least as you need to build them today and for the foreseeable future, are a continuation of the same model that has been used for data warehousing since the early 1990s. For many technical reasons related to performance, deep analysis involving large data volumes and significant cross-referencing directly in your source applications isn’t a workable solution for the bulk of your analytics.

Consequently, you need to make copies of the operational data that you want for analytical purposes and store that data in your data lake. Think of the data inside your data lake as (in used-car terminology) previously owned data that has been refurbished and is now ready for a brand-new owner.

But if you can’t adequately do complex analytics directly from source applications and their databases, what about this idea: Run your applications off your data lake instead! This way, you can avoid having to copy your data, right? Unfortunately, that idea won’t work, at least with today’s technology.

Operational applications almost always use a relational database, which manages concurrency control among their users and applications. In simple terms, hundreds or even thousands of users can add new data and make changes to a relational database without interfering with each other’s work and corrupting the database. A data lake, however, is built on storage technology that is optimized for retrieving data for analysis and doesn’t support concurrency control for update operations.

Many vendors are working on new technology that will allow you to build a data lake for operational, as well as analytical purposes. This technology is still a bit down the road from full operational viability. For the time being, you’ll build a data lake by copying data from many different source applications.

Подняться наверх