Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 106
6.1 Unsupervised Learning
ОглавлениеUnsupervised learning is a type of learning that draws inductions from the unlabeled dataset [84]. Data stream source is nonstationary, and for clustering algorithms, there is no information about the data distribution in advance [85]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, the entirety of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in data as it evolves, which involves characterizing the streaming data in the form of clusters to leverage them to report useful and interesting patterns in the data stream [86]. Unsupervised learning algorithms are suitable for analyzing data stream as it does not require a predefined label [87]. Clusters are ordered dependent on scoring function, for example, catchphrase or keyword, hashtags, the semantic relationship of terms, and segment extraction [88].
Data stream clustering can be grouped into five categories, which are partitioning methods, hierarchical methods, model‐based methods, density‐based methods, and grid‐based methods.
Partition‐based techniques try to find out k‐partitions based on some measurement. Partitioning clustering methods are not suitable for streaming scenarios since they require earlier information on cluster number. Examples of partition‐based methods include Incremental K‐Mean, STREAMKM++, Stream LSearch, HPStream, SWClustering, and CluStream.
Hierarchical methods can be further subdivided into divisive and agglomerative. With divisive hierarchical clustering, a cluster is divided into small clusters until it cannot be split further. In contrast, agglomerative hierarchical clustering merges up separate clusters until the distance between two clusters reaches a required threshold. Balanced iterative reducing and clustering using hierarchies (BIRCH), open distributed application construction (ODAC), E‐Stream, clustering using representatives (CURE), and HUE‐ are some hierarchical algorithms for data stream analysis.
In model‐based methods, a hypothesized model is run for each cluster to check which data properly fits a cluster. Some of the algorithms that fit into this category are CluDistream, Similarity Histogram‐based Incremental Clustering, sliding window with expectation maximization (SWEM), COBWEB, and Evolving Fractal‐Based Clustering of Data Streams.
Density‐based methods separate data into density regions (i.e., nonoverlapping cells) of different shapes and sizes. Density‐based algorithms require a single pass and can handle noise. Stating the number of clusters in advance is not also required. Some density‐based algorithms include DGStream, MicroTEDAclus, clustering of evolving data‐streams into arbitrary shapes (CEDAS), Incremental DBSCAN (Density‐Based Spatial Clustering with Noise), DenStream, r‐DenStream, DStream, DBstream, data stream clustring (DSCLU), MR‐Stream, Ordering Points to Identify Clustering Structure (OPTICS), OPClueStream, and MBG‐Stream.