Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 111

8 Best Practices for Managing Data Streams

A data stream is so dynamic that dealing with data in motion is not just limited to design‐time but also a run‐time problem that requires an operation that must be managed in real‐time. Stream computing has emerged as a capability of real‐time applications in smart cities, monitoring systems, manufacturing, and financial markets [15]. Data stream management systems should be able to update the answers to continuous queries as new data arrives. Choosing the right processing model for streaming data is challenging, given the growing number of frameworks with various and similar services [114]. When a high volume of data from disparate sources is needed to be processed at a short time interval, Storm and Flink may be considered. For purely stream processing, Storm is recommended for high stream‐oriented applications as it can process millions of events per second. When it comes to durability, scalability, high‐throughput, and low‐latency capabilities, Apache Kafka is a good option [115]. Yahoo! S4 has capabilities for real‐time response, fault‐tolerance, and scalability [116]. Spark framework may be suitable for periodic processing tasks such as fraud detection, web usage mining, and so on. For a task that combines both batch and streaming programming models such as IoT and healthcare, Spark and Flink may be good candidates [117]. Some of the frameworks that support iterative processing or machine learning tasks are Flink (FlinkML) Spark (Spark MLlib), GraphX with Spark, and Flinkgelly with Flink. Other graph processing frameworks include Bladgy, Graphlab, and Trinity.

IBM InfoSphere Streams can handle millions of messages or events in a second with high throughput rates, making it one of the leading proprietary solutions for real‐time applications [61]. Apama Stream Analytics is suitable for real‐time and high‐volume business operations [62]. Azure Stream is another proprietary solution for driving streaming analytics and IoT goals [62]. Other reasonable proprietary solutions include Kinesis, PieSync, TIBCO Spotfire, Google Cloud Pub/Sub, Azure Event Hubs, Kibana, Amazon Elastic Search Service, and Kibana.

In an ideal case, choosing a single streaming data technology that supports all the system requirements such as the state of data, use case, and kind of results seems the best as this alleviates the problems of interoperability constraints.

Computational Statistics in Data Science

Подняться наверх