Читать книгу Big Data - Seifedine Kadry - Страница 26
1.7 Big Data Infrastructure
ОглавлениеThe core components of big data technologies are the tools and technologies that provide the capacity to store, process, and analyze the data. The method of storing the data in tables was no longer supportive with the evolution of data with 3 Vs, namely volume, velocity, and variety. The robust RBDMS was no longer cost effective. The scaling of RDBMS to store and process huge amount of data became expensive. This led to the emergence of new technology, which was highly scalable at very low cost.
The key technologies include
Hadoop
HDFS
MapReduce
Hadoop – Apache Hadoop, written in Java, is open‐source framework that supports processing of large data sets. It can store a large volume of structured, semi‐structured, and unstructured data in a distributed file system and process them in parallel. It is a highly scalable and cost‐effective storage platform. Scalability of Hadoop refers to its capability to sustain its performance even under highly increasing loads by adding more nodes. Hadoop files are written once and read many times. The contents of the files cannot be changed. A large number of computers interconnected working together as a single system is called a cluster. Hadoop clusters are designed to store and analyze the massive amount of disparate data in distributed computing environments in a cost effective manner.
Hadoop Distributed File system – HDFS is designed to store large data sets with streaming access pattern running on low‐cost commodity hardware. It does not require highly reliable, expensive hardware. The data set is generated from multiple sources, stored in an HDFS file system in a write‐once, read‐many‐times pattern, and analyses are performed on the data set to extract knowledge from it.
MapReduce – MapReduce is the batch‐processing programming model for the Hadoop framework, which adopts a divide‐and‐conquer principle. It is highly scalable, reliable, and fault tolerant, capable of processing input data with any format in parallel and distributed computing environments supporting only batch workloads. Its performance reduces the processing time significantly compared to the traditional batch‐processing paradigm, as the traditional approach was to move the data from the storage platform to the processing platform, whereas the MapReduce processing paradigm resides in the framework where the data actually resides.