Читать книгу Data Science For Dummies - Lillian Pierson - Страница 45
Storing data on the Hadoop distributed file system (HDFS)
ОглавлениеThe HDFS uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.
The HDFS is characterized by these three key features:
HDFS blocks: In data storage, a block is a storage unit that contains some maximum number of records. HDFS blocks can store 64 megabytes of data, by default.
Redundancy: Datasets that are stored in HDFS are broken up and stored on blocks. These blocks are then replicated (three times, by default) and stored on several different servers in the cluster, as backup, or as redundancy.
Fault-tolerance: As mentioned earlier, a system is described as fault-tolerant if it’s built to continue successful operations despite the failure of one or more of its subcomponents. Because the HDFS has built-in redundancy across multiple servers in a cluster, if one server fails, the system simply retrieves the data from another server.