Читать книгу Data Lakes For Dummies - Alan R. Simon - Страница 22

Refilling the data lake

What exactly does “copying data” look like, and how frequently do you need to copy data into the data lake?

Data lakes mostly use a technique called ELT, which stands for either extract, transform, and load or extraction, transformation, and loading. With ELT, you “blast” your data into a data lake without having to spend a great deal of time profiling and understanding the particulars of your data. You extract data (the E part of ELT) from its original home in a source application, and then, after that data has been transmitted to the data lake, you load the data (the L) into its initial storage location. Eventually, when it’s time for you to use the data for analytical purposes, you’ll need to transform the data (the T) into whatever format is needed for a specific type of analysis.

For data warehousing — the predecessor to data lakes that you’re almost certainly still also using — data is copied from source applications to the data warehouse using a technique called ETL, rather than ELT. With ETL, you need to thoroughly understand the particulars of your data on its way into the data warehouse, which requires the transformation (T) to occur before the data is loaded (L) into its usable form.

With ELT, you can control the latency, or “freshness,” of data that is brought into the data lake. Some data needed for critical, real-time analysis can be streamed into the data lake, which means that a copy is sent to the data lake immediately after data is created or updated within a source application. (This is referred to as a low-latency data feed.) You essentially push data into your data lake piece by piece immediately upon the creation of that data.

Other data may be less time-critical and can be “batched up” in a source application and then periodically transmitted in bulk to the data lake.

You can specify the latency requirements for every single data feed from every single source application.

The ELT model also allows you to identify a new source of data for your data lake and then very quickly bring in the data that you need. You don’t need to spend days or weeks dissecting the ins and outs of the new data source to understand its structure and business rules. You “blast” the data into your data lake in the natural form of the data: database tables, MP4 files, or however the data is stored. Then, when it’s time to use that data for analysis, you can proceed to dig into the particulars and get the data ready for reports, machine learning, or however you’re going to be using and analyzing the data.

Подняться наверх