Читать книгу Data Science For Dummies - Lillian Pierson - Страница 61

Using Spark to generate real-time big data analytics

Apache Spark is an in-memory distributed computing application that you can use to deploy machine learning algorithms on big data sources in near-real-time to generate analytics from streaming big data sources. Whew!

In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides its results a lot faster but cannot process much data per processing interval.

Because it processes data in microbatches, with 3-second cycle times, you can use it to significantly decrease time-to-insight in cases where time is of the essence. It can be run on data that sits in a wide variety of storage architectures, including Hadoop HDFS, Amazon Redshift, MongoDB, Cassandra, Solr and AWS. Spark is composed of the following submodules:

Spark SQL: You use this module to work with and query structured data using Spark. Within Spark, you can query data using Spark’s built-in SQL package: SparkSQL. You can also query structured data using Hive, but then you’d use the HiveQL language and run the queries using the Spark processing engine.

GraphX: The GraphX library is how you store and process network data from within Spark.

Streaming: The Streaming module is where the big data processing takes place. This module basically breaks a continuously streaming data source into much smaller data streams, called Dstreams — discreet data streams, in other words. Because the Dstreams are small, these batch cycles can be completed within three seconds, which is why it’s called microbatch processing.

MLlib: The MLlib submodule is where you analyze data, generate statistics, and deploy machine learning algorithms from within the Spark environment. MLlib has APIs for Java, Scala, Python, and R. The MLlib module allows data professionals to work within Spark to build machine learning models in Python or R, and those models will then pull data directly from the requisite data storage repository, whether that be on-premise, in a cloud, or even a multicloud environment. This helps reduce the reliance that data scientists sometimes have on data engineers. Furthermore, computations are known to be 100 times faster when processed in-memory using Spark as opposed to the traditional MapReduce framework.

You can deploy Spark on-premise by downloading the open-source framework from the Apache Spark website, at http://spark.apache.org/downloads.html. Another option is to run Spark on the cloud via the Apache Databricks service, at https://databricks.com.

Подняться наверх