Читать книгу Big Data - Seifedine Kadry - Страница 37

1.9 Big Data Technology

With the advancement in technology, the ways the data are generated, captured, processed, and analyzed are changing. The efficiency in processing and analyzing the data has improved with the advancement in technology. Thus, technology plays a great role in the entire process of gathering the data to analyzing them and extracting the key insights from the data.

Apache Hadoop is an open‐source platform that is one of the most important technologies of big data. Hadoop is a framework for storing and processing the data. Hadoop was originally created by Doug Cutting and Mike Cafarella, a graduate student from the University of Washington. They jointly worked with the goal of indexing the entire web, and the project is called “Nutch.” The concept of MapReduce and GFS were integrated into Nutch, which led to the evolution of Hadoop. The word “Hadoop” is the name of the toy elephant of Doug’s son. The core components of Hadoop are HDFS, Hadoop common, which is a collection of common utilities that support other Hadoop modules, and MapReduce.

Figure 1.12 Hadoop core components.

Apache Hadoop is an open‐source framework for distributed storage and for processing large data sets. Hadoop can store petabytes of structured, semi‐structured, or unstructured data at low cost. The low cost is due to the cluster of commodity hardware on which Hadoop runs.

Figure 1.12 shows the core components of Hadoop. A brief overview about Hadoop, MapReduce, and HDFS was given under Section 1.7, “Big Data Infrastructure.” Now, let us see a brief overview of YARN and Hadoop common.

YARN – YARN is the acronym for Yet Another Resource Negotiator and is an open‐source framework for distributed processing. It is the key feature of Hadoop version 2.0 of the Apache software foundation. In Hadoop 1.0 MapReduce was the only component to process the data in distributed environments. Limitations of classical MapReduce have led to the evolution of YARN. The cluster resource management of MapReduce in Hadoop 1.0 was taken over by YARN in Hadoop 2.0. This has lightened up the task of MapReduce and enables it to focus on the data processing part. YARN enables Hadoop to run jobs other than MapReduce jobs as well.

Hadoop common – Hadoop common is a collection of common utilities, which supports other Hadoop modules. It is considered as the core module of Hadoop as it offers essential services. Hadoop common has the scripts and Java Archive (JAR) files that are required to start Hadoop.

Подняться наверх