Читать книгу Big Data - Seifedine Kadry - Страница 57

2.2.1 Sharding

Sharding is the process of partitioning very large data sets into smaller and easily manageable chunks called shards. The partitioned shards are stored by distributing them across multiple machines called nodes. No two shards of the same file are stored in the same node, each shard occupies separate nodes, and the shards spread across multiple nodes collectively constitute the data set.

Figure 2.6a shows that a 1 GB data block is split up into four chunks each of 256 MB. When the size of the data increases, a single node may be insufficient to store the data. With sharding more nodes are added to meet the demands of the massive data growth. Sharding reduces the number of transaction each node handles and increases throughput. It reduces the data each node needs to store.

Figure 2.5 Distribution model.

Figure 2.6 (a) Sharding. (b) Sharding example.

Figure 2.6b shows an example as how a data block is split up into shards across multiple nodes. A data set with employee details is split up into four small blocks: shard A, shard B, shard C, shard D and stored across four different nodes: node A, node B, node C, and node D. Sharding improves the fault tolerance of the system as the failure of a node affects only the block of the data stored in that particular node.

Подняться наверх