Читать книгу Intelligent Network Management and Control - Badr Benmammar - Страница 26
1.3.5. Clustering techniques
ОглавлениеClustering techniques operate by organizing observed data in groups, depending on a given similarity or a distance measurement. Similarity can be measured by using the cosine formula, the binary weighted cosine formula proposed by Rawat (2005) or other formulas. The most commonly used procedure for clustering involves the selection of a representative point for each cluster. Then each new data point is classified as belonging to a given group depending on the proximity to the corresponding representative point. There are at least two approaches for the classification-based detection of anomalies. In the first approach, the anomaly detection model is formed using unlabeled data including both normal and attack traffic. In the second approach, the model is formed using only normal data and a normal activity profile is created. The idea underlying the first approach is that abnormal or attack data represent a small percentage of the total data. If this hypothesis is verified, anomalies and attacks can be detected depending on cluster size: large clusters correspond to normal data and the other data points to attacks. Liao and Vemuri (2002) used the K-nearest neighbor (K-nn) approach, based on the Euclidian distance, to define the belonging of data points to a given cluster. The Minnesota intrusion detection system is a network-based anomaly detection approach that uses data exploration and clustering techniques (Levent et al. 2004).
Leung and Leckie (2005) proposed an unsupervised anomaly detection approach for intrusion detection on a network. The proposed algorithm, known as “fpMAFIA”, is a clustering algorithm based on density and on grid for large data sets. The major advantage of this algorithm is that it can produce arbitrary forms and cover over 95% of the set of data with appropriate values of parameters. The authors proved that the algorithm evolves linearly with respect to the number of registrations in the set of data. They evaluated the accuracy of the newly proposed algorithm and proved that it enables reaching a reasonable detection rate.