Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 110

7 Strategies for Processing Data Streams

Data stream processing includes techniques, models, and systems for processing data as soon as they arrive to detect trends and patterns in a low latency [109]. Data stream processing requires two factors which include storage capability and computational power in the face of an unbounded generation of data with high velocity and brief life span. To cope with these requirements, approximate computing, which aims at low latency at the expense of acceptable quality loss, has been a practical solution [110]. The ideology behind approximate computing is based on returning approximate answer instead of the exact answer for user queries. This is done by choosing a representative sample of data instead of the whole data [111]. The two main techniques for approximate computing includes (i) sampling [4], which constructs data stream summaries by probability selection, and (ii) sketches [112], which compress data using data structure (such as histogram or hash tables), prediction‐based method (such as Bayesian Inference), and transformation‐based method (such as wavelet).

Fixed window and sliding window are two computation models for the partitioning of the data stream. Fixed window partitions data stream into nonoverlapping time segments, and the current data are removed after processing, resetting the window size back to zero. The sliding window contains a historical snapshot of the data stream at any point in time. When the arriving data are at variance with the current window elements, tuples are updated by discarding the oldest data [5]. The sliding window can be further sub‐divided into a count‐based window and time‐based window. In the count‐based window, the progressive step is expressed in tuple counts, while items with the oldest timestamp are replaced with items with the latest timestamp in the time‐based window [113].

Computational Statistics in Data Science

Подняться наверх