Читать книгу The Smart Cyber Ecosystem for Sustainable Development - Группа авторов - Страница 44

2.4.3.2 Unsupervised Learning

In this technique, data is submitted to the learning algorithm without predefined knowledge or labels. Thus, the machine has to learn the properties of the dataset by itself through the study of unlabeled training data. The algorithm shall be able to define patterns from the input data. Observations are clustered in groups according to the similarities between them. The clustering algorithm examines the similarity of observations based on their features.

Figure 2.4 Illustration of KNN.

Observations are then grouped in a way that puts elements that share a high similarity in the same group. Normally, algorithms use distance functions to measure similarities of observations. With Unsupervised learning, no prior knowledge is required. However, this comes at the cost of reduced accuracy [6].

The most commonly known unsupervised algorithm is clustering. Clustering algorithms divide data samples into several categories, called clusters. Clustering algorithms are of four main types [7]:

Centroid-Based Clustering: Clusters are defined using centroids. Centroids are data points that represent the proto-element of each group. The number of clusters has to be defined beforehand and is fixed. In the beginning, cluster centroids are defined randomly and will be shifted in the state space iteratively until the specified distance function is minimized.

K-Means Clustering is a simple and most common centroid-based method. The objective is to partition data points into K clusters, where each data point should belong to the cluster with the nearest mean. Initially, K mean points are randomly picked (the centroids). Then, the algorithm iterates on each data point and computes the distance to the centroids. The data point is judged to belong to the point to which the computed Euclidean distance is minimum. Thus, the method minimizes the distance between points and their corresponding centroids. Centroids are updated based on their assigned data points. The process continues until the centroids do not change. Figure 2.5 illustrates the concept of K-means clustering.

Hierarchical Clustering: In this type, the number of clusters is not defined a priori; rather, it is iteratively increases or decreases. In the beginning, all observations are included in one cluster. Then, the cluster is split according to the largest distance between the data points. Once a sufficient number of clusters is reached, the process is stopped.

Figure 2.5 Illustration of K-means clustering.

Density-Based Clustering: In this type of clustering, the algorithm tries to find the areas with high and low density of observations. Data points that are within a specified distance become centers of a cluster. Other data points either belong to a cluster border or considered as noise.

The Smart Cyber Ecosystem for Sustainable Development

Подняться наверх