Читать книгу Data Lakes For Dummies - Alan R. Simon - Страница 31

THE THREE (OR FOUR OR FIVE OR MORE) VS OF BIG DATA AND DATA LAKES

Оглавление

Quick quiz: Name all the Vs of big data and data lakes. You can start with the original three: volume, variety, and velocity. But you’ll also find blog posts and online articles that mention value, veracity (a formal term for accuracy), visualization, and many others. In fact, don’t be surprised if one day you read an article or blog post that also includes Valentine’s Day!

The original three Vs of big data came from a Gartner Group analyst named Doug Laney, way back in 2001. Volume, variety, and velocity were primarily aspirational characteristics of data environments, describing next-generational characteristics beyond what the relational databases of the time were capable of supporting.

Over the years, other industry analysts, bloggers, consultants, and product vendors added to the list with their own Vs. The difference between the original three Vs and those that followed, though, is that value, veracity, visualization, and others all apply to tried-and-true relational technology just as much as to big data.

Don’t get confused trying to decide how many Vs apply to big data and to data lakes. Just focus on the original three — volume, variety, and velocity — as the must-have characteristics of your data lake.

You’ll find varying perspectives on the relationship between big data and data lakes, which certainly confuses the issue. Some technologists reverse the relationship between big data and data lakes; they consider a data lake to be the core technology and big data to be the overall environment. So, if you run across a blog post or another description that differs from the one I use, don’t worry. As with almost everything about data lakes and much of the technology world, you’ll find all sorts of opinions and perspectives, especially when you don’t have any official standards to govern a discipline.

The Hadoop open source environment, particularly the HDFS, is one of the first and most popular examples of big data. Some of the earliest data lakes were built, or at least begun, using HDFS as the foundation.

For purposes of establishing a data lake foundation, Amazon’s S3 and Microsoft’s ADLS both qualify as big data. Why? Both S3 and ADLS support the three Vs of big data, which are as follows:

 Storing extremely large volumes of data

 Supporting a variety of data, including structured, unstructured, and semi-structured data

 Allowing very high velocity for incoming data into the data lake rather than requiring or at least encouraging periodic batches of data

Think of big data as a core technology foundation that supports the three Vs of next-generation data management. Big data by itself, however, is just a platform. It’s the natural body of water — the lake itself — at a popular lakeside resort. When you divide your big data into multiple zones, add capabilities to transmit data across those zones, and then govern the whole environment, you’ve built a data lake surrounding that big data foundation. You’ve done the analytical data equivalent of building the docks, the restaurants, and the boat slips surrounding the lake itself.

Data Lakes For Dummies

Подняться наверх