Читать книгу Data Science For Dummies - Lillian Pierson - Страница 44
Incorporating MapReduce, the HDFS, and YARN
ОглавлениеMapReduce is a parallel distributed processing framework that can process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples. (With respect to MapReduce, tuples refers to key-value pairs by which data is grouped, sorted, and processed.) In layperson terms, MapReduce uses parallel distributed computing to transform big data into data of a manageable size.
In Hadoop, parallel distributed processing refers to a powerful framework in which data is processed quickly via the distribution and parallel processing of tasks across clusters of commodity servers.