Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 104

5 Streaming Data Pre‐Processing: Concept and Implementation

Data stream pre‐processing, which aims at reducing the inherent complexity associated with streaming data for a faster, more understandable, and interpretable, and more precise learning process is an essential technique in knowledge discovery. However, despite the recorded growth in online learning, data stream pre‐processing methods still have a long way to go due to the high level of noise [66]. These noisy terms incorporate a short length of messages, slangs, abbreviations, acronyms, blended dialects, linguistic and spelling mistakes, sporadic, casual, shortened words, and ill‐advised sentence structure, which make it hard for learning algorithms to perform productively and adequately [67]. Additionally, error from sensor reading due to low battery, damage, incorrect calibrations, among others, can render data delivered from such sensors unsuitable for analysis [68].

Data quality is a fundamental determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to strengthen data stream pre‐processing stage in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, data stream pre‐processing techniques with low computational requirement [73] needs to be evolved as this is still open for research. Moreover, the representation of social media posts must be in a way that the semantics of social media content is preserved [74, 75]. To improve the result of analysis in the data stream, there is need to develop frameworks that will cope with the noisy characteristics, redundancy, heterogeneity, data imbalance, transformation, feature representation, or selection issues in data streams [26]. Some of the new frameworks developed for pre‐processing and enriching data stream for better results are SlangSD [76], N‐gram and Hidden Markov Model [77], SLANGZY [78], and SMFP [67].

Computational Statistics in Data Science

Подняться наверх