Читать книгу SCADA Security - Xun Yi - Страница 17
1.3 SIGNIFICANT RESEARCH PROBLEMS
ОглавлениеIn recent years, many researchers and practitioners have turned their attention to SCADA data to build data‐driven methods that are able to learn the mechanistic behavior of SCADA systems without a knowledge of the physical behavior of these systems. Such methods have shown a promising ability to detect anomalies, malfunctions, or faults in SCADA components. Nonetheless, it remains a relatively open research area to develop unsupervised SCADA data‐driven detection methods that can be time‐ and cost‐efficient for learning detection methods from unlabeled data. However, such methods often have a low detection accuracy. The focus of this book is about the design of an efficient and accurate unsupervised SCADA data‐driven IDS, and four main research problems are formulated here for this purpose. Three of these pertain to the development of methods that are used to build a robust unsupervised SCADA data‐driven IDS. The fourth research problem relates to the design of a framework for a SCADA security testbed that is intended to be an evaluation and testing environment for SCADA security in general and for the proposed unsupervised IDS in particular.
1 How to design a SCADA‐based testbed that is a realistic alternative for real SCADA systems so that it can be used for proper SCADA security evaluation and testing purposes. An evaluation of the security solutions of SCADA systems is important. However, actual SCADA systems cannot be used for such a purpose because availability and performance, which are the most important issues, are most likely to be affected when analysing vulnerabilities, threats, and the impact of attacks. To address this problem, “real SCADA testbeds” have been set up for evaluation purposes, but they are costly and beyond the reach of most researchers. Similarly, small real SCADA testbeds have also been set up; however, they are still proprietary and location‐constrained. Unfortunately, such labs are not available to researchers and practionners interested in working on SCADA security. Hence, the design of a SCADA‐based testbed for that purpose will be very useful for evaluation and testing purposes. Two essential parts could be considered here: SCADA system components and a controlled environment. In the former, both high‐level and field‐level components will be considered and the integration of a real SCADA protocol will be devised to realistically produce SCADA network traffic. In the latter, it is important to model a controlled environment such as smart grid power or water distribution systems so that we can produce realistic SCADA data.
2 How to make an existing suitable data mining method deal with large high‐dimensional data. Due to the specific nature of the unsupervised SCADA systems, an IDS will be designed here based on SCADA data‐driven methods from the unlabeled SCADA data which, it is highly expected, will contain anomalous data; the task is intended to give an anomaly score for each observation. The ‐Nearest Neighbour (‐NN) algorithm was found, from an extensive literature review, to be one of the top ten most interesting and best algorithms for data mining in general (Wu et al., 2008), and, in particular, it has demonstrated promising results in anomaly detection (Chandola et al., 2009). This is because the anomalous observation is assumed to have a neighborhood in which it will stand out, while a normal observation will have a neighborhood where all its neighbors will be exactly like it. However, having to examine all observations in a data set in order to find ‐NN for an observation is the main drawback of this method, especially with a vast amount of high dimensional data. To efficiently utilize this method, the reduction of computation time in finding ‐NN is the aim of this research problem that this book endeavors to address.
3 How to learn clustering‐based proximity rules from unlabeled SCADA data for SCADA anomaly detection methods. To build efficient SCADA data‐driven detection methods, the efficient proposed ‐NN algorithm in problem 2 is used to assign an anomaly score to each observation in the training data set. However, it is impractical to use all the training data in the anomaly detection phase. This is because a large memory capacity is needed to store all scored observations and it is computationally infeasible to compute the similarity between these observations and each current new observation. Therefore, it would be ideal to efficiently separate the observations, which are highly expected to be consistent (normal) or inconsistent (abnormal). Then, a few proximity detection rules for each behavior, whether consistent or inconsistent, are automatically extracted from the observations that belong to that behavior.
4 How to compute a global and efficient anomaly threshold for unsupervised detection methods. Anomaly‐scoring‐based and clustering‐based methods are among the best‐known ones that are often used to identify the anomalies in unlabeled data. With anomaly‐scoring‐based methods (Eskin et al., 2002; Angiulli and Pizzuti, 2002; Zhang and Wang, 2006), all observations in a data set are given an anomaly score and therefore actual anomalies are assumed to have the highest scores. The key problem is how to find the near‐optimal cut‐off threshold that minimizes the false positive rate while maximizing the detection rate. On the one hand, clustering‐based methods (Portnoy et al., 2001; Mahoney and Chan, 2003a; Portnoy et al., 2001; Jianliang et al., 2009; Münz et al., 2007) group similar observations together into a number of clusters, and anomalies are identified by making use of the fact that those anomalous observations will be considered as outliers, and therefore will not be assigned to any cluster, or they will be grouped in small clusters that have some characteristics that are different from those of normal clusters. However, the detection of anomalies is controlled through several parameter choices within each used detection method. For instance, given the top 50% of the observations that have the highest anomaly scores, these are assumed as anomalies. In this case, both detection and false positive rates will be much higher. Similarly, labeling a low percentage of largest clusters as normal in clustering‐based intrusion detection methods will result in higher detection and false positive rates. Therefore, the effectiveness of unsupervised intrusion methods is sensitive to parameter choices, especially when the boundaries between normal and abnormal behavior are not clearly distinguishable. Thus, it would be interesting to identify the observations whose anomaly scores are extreme and significantly deviate from others, and then such observations are assumed to be “abnormal”. On another hand, the observations whose anomaly scores are significantly distant from “abnormal” ones will be assumed to be “normal”. Then, the ensemble‐based supervised learning is proposed to find a global and efficient anomaly threshold using the information of both “normal”/“abnormal” behavior.