Читать книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan - Страница 22
Streaming Data
ОглавлениеStreaming data is a set of data that is typically sent in small messages that are transmitted continuously from the data source. Streaming data may be sensor data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Examples of streaming data include the following:
Virtual machine monitoring data, such as CPU utilization rates and memory consumption data
An IoT device that sends temperature, humidity, and pressure data every minute
A customer adding an item to an online shopping cart, which then generates an event with data about the customer and the item
Streaming data often includes a timestamp indicating the time that the data was generated. This is often called the event time. Some applications will also track the time that data arrives at the beginning of the ingestion pipeline. This is known as the process time. Time-series data may require some additional processing early in the ingestion process. If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order. Of course, there is a maximum amount of time to wait before processing data. These and other issues related to processing streaming data are discussed in Chapter 4, “Designing a Data Processing Solution.”
Streaming data is well suited for Cloud Pub/Sub ingestion, which can buffer data while applications process the data. During spikes in data ingestion in which application instances cannot keep up with the rate data is arriving, the data can be preserved in a Cloud Pub/Sub topic and processed later after application instances have a chance to catch up. Cloud Pub/Sub has global endpoints and uses GCP’s global frontend load balancer to support ingestion. The messaging service scales automatically to meet the demands of the current workload.