Читать книгу Artificial Intelligence and Data Mining Approaches in Security Frameworks - Группа авторов - Страница 28

2.2 Data Mining Techniques and Their Role in Classification and Detection

Оглавление

Malware computer programs that repeat themselves for spreading out from one computer to another computer are called worms. Malware comprises adware, worms, Trojan horse, computer viruses, spyware, key loggers, http worm, UDP worm and port scan worm, and remote to local worm, other malicious code and user to root worm (Herzberg, Gbara, 2004). There are various reasons that attackers write these programs, such as:

1 i) Computer process and its interruption

2 ii) Assembling of sensitive information

3 iii) A private system can gain entry

It is very important to detect a worm on the internet because of the following two reasons:

1 i) It creates vulnerable points

2 ii) Performance of the system can be reduced

Therefore, it is important to notice the worm ot the onset and categorize it with the help of data mining classification algorithms. Given below are the classification algorithms that can be used; Bayesian network, Random Forest, Decision Tree, etc. (Rathore et al., 2013). An underlying principle is that the intrusion detection system (IDS) can be used by the majority of worm detection techniques. It is very difficult to predict that what will be the next form taken by a worm.

There is a challenge in automatic detection of a worm in the system.

Intrusion Detection Systems can be broadly classified into two types:

1 i) On the basis of network: network packets are reflected till that time unless they are not spread to an end-host

2 ii) On the basis of host: Those network packets are reflected which have already been spread to the end-host

Furthermore, encode network packets is the core area of host-based detection IDS to hit the stroke of the internet worm. We have to pay attention towards the performances of traffic in the network by focusing on the without encoding network packet. Numerous machine learning techniques have been used for worm and intrusion detection systems. Thus, data mining and machine learning techniques are essential as well as they have an important role in a worm detection system. Numerous Intrusion Detection models have been proposed by using various data mining schemes. To study irregular and usual outlines from the training set, Decision Trees and Genetic Algorithms of Machine Learning can be employed and then on the basis of generated classifiers, there could be labeled as Normal or Abnormal classes for test data. The labelled data, “Abnormal”, is helpful to point out the presence of an intrusion.

a) Decision Trees

One of the most popular machine learning techniques is Quinlan’s decision tree technique. A number of decisions and leaf nodes are required to construct the tree by following divide-and conquer technique (Rathore et al., 2013). A condition needs to be tested by using attributes of input data with the help of each decision node to handle separate outcome of the test. In decision tree, we have a number of branches. A leaf node is represented by the result of decision. A training data set T is having a set of n-classes {C1, C2,..., Cn} when the training dataset T comprises cases belonging to a single class, it is treated as a leaf. T can also be treated as a leaf if T is empty with no cases. The number of test outcomes can be denoted by k if there are k subsets of T i.e. {T1, T2, ..., Tk}, where. The process is recurrent over each Tj, where 1 <= j<= n, until every subset does not belong to a single class. While constructing the decision tree, choose the best attribute for each decision node. Gain Ratio criteria are adopted by the C4.5 Decision Tree. By using this criterion, the attribute that provides maximum information gain by decreasing the bias/favoritism test is chosen. Thus, to classify the test data that built tree is used whose features and features of training data are the same. Approval of the above test can be done by starting from the root node. On the basis of the result, a branch that leads to a child must be followed. The process would be repeated recursively for the time until the child is not a leaf. To examine a class and its corresponding leaf, test cases must be applied.

b) Genetic Algorithms (GA)

It is used to solve a problem by using biological evolution techniques with the help of machine learning approach. A population of candidate solutions can be optimized with the help of Genetic Algorithm. In genetic algorithm genetic operators, i.e., selection, crossover and mutation are helpful for data structures modelling on chromosomes (Fu et al., 2006). In the beginning, random generation of a population of chromosomes could be performed. In this way, there will be all possible solutions of a problem in the population and that is considered as the candidate solutions. Dissimilar locations of a chromosome called “genes” which can be determined as numbers, characters or bits. To evaluate the goodness of each chromosome on the basis of the desired solution, we use fitness function. Natural reproduction can be stimulated by crossover operator whereas mutation of the species is stimulated by mutation operator. Fittest chromosomes can be chosen by the selection operator (Manek et al., 2016). Genetic Algorithms and its operations can be represented by Figure 2.2. Following are three important factors which we have to consider before using genetic algorithm for solving various problems.

Figure 2.2 Flowchart of genetic algorithm.

1 Fitness function

2 Individuals representation

3 Genetic algorithms parameters

For designing an artificial immune system, genetic algorithm-based method can be used. By using this method, a method for smartphone malware detection has been proposed by Bin et al. (Wu et al., 2015). In this approach, static and dynamic signatures of malwares were extracted to obtain the malicious scores of tested samples.

c) Random Forest

It is a classification algorithm that uses collection of tree structured classifiers. In this algorithm, a class is chosen as winner class on the basis of votes given by an individual tree of the forest. To construct a tree, there is a requirement of arbitrary data from a training dataset. Thus, the selected dataset could be divided into training dataset and test dataset. Training data comprises the major portion of the dataset whereas the test data will have the minor portion of the dataset. Following are the steps required for the tree construction:

1 A sample of N cases is arbitrarily selected from the original dataset which represents the training set required for growing the tree.

2 Out of the M input variables, m variables can be selected arbitrarily. Value of m will be constant at the time of growing the forest.

3 Maximum possible value can be given to each tree in the forest. There is no requirement of trimming or Pruning of the tree.

4 To form the random forest, all classification trees can be combined. The problem of overfitting on large dataset can be fixed with the help of random forest. It can also train/ test quickly on complex data set. It can also be referred as Operational Data mining technique.

Each and every classification tree can be used to cast vote for a class because of its special feature. On the basis of maximum votes assigned to a class, a solution class is built.

d) Association-rule mining

It is used to find fascinating relationships among a set of attributes in datasets (Dwork et al., 2006). Association rule can be defined as inter-relationship of a dataset. It is very helpful to build strategic decisions about different actions like shelf management, promotional pricing, and many more (Jackson et al., 2007). Earlier, a data analyst was involved in association rule mining whose task is to discover patterns or association rules in the dataset given to him (Rathore, 2017). It is possible to attain sophisticated analysis on these extremely large datasets in a cost-effective manner (Tseng et al., 2016), but there may be a chance of data security risk (Beaver et al., 2009) for the data possessor because data miner cans mines sensitive information (Bhargava et al., 2017). Nowadays, in knowledge data discovery (KDD) association rule mining is extensively used for pattern discovery. A problem of (ARM) can be solved by navigating the items in a database with the help of various algorithms on the basis of user’s requirement (Patel et al., 2014). Association rule mining (ARM) algorithms can be broadly classified into DFS (Depth First Search) and BFS (Breadth First Search) on the basis of approach used for traversing the search space (Stanley, 2013). These two methods, i.e., DFS (Depth First Search) and BFS (Breadth First Search) are further divided into methods – intersecting and counting, on the basis of item sets and their support value. The algorithms Apriori-DIC, Apriori and Apriori-TID are BFS-based counting strategies algorithms, whereas partition algorithms are intersecting strategies BFS algorithms. The Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm works on the intersecting strategy with DFS. DFS with Counting strategies comprises FP-Growth algorithm (Yeung, Ding, 2003), (Bloedorn et al., 2003). For improvement in speed, these algorithms can be optimized specifically (Barrantes et al., 2001), (Reddy et al., 2011).

Breadth First Search (BFS) with Counting Occurrences: An eminent algorithm in this group is Apriori algorithm. By clipping the candidates with rare subsets and with the help of this algorithm, the downward closure property of an itemset can be utilized. It should be done before counting their support. Two important parameters to be measured at the time of association rule evaluation which is: support and confidence. In BFS, it is possible to do desired optimization by knowing the support values of all subsets of the candidates in advance. The main drawback of the above mentioned is the increment in computational complexity in a rule that has been extracted from a large database. An improved, dispersed and unsecured form of the Apriori algorithm is Fast Distributed Mining (FDM) algorithm (Lee et al., 1999). Organizations are able to use data more competently with the help of advancements in data mining techniques.

It is possible in Apriori to count the candidates of a cardinality k with the help of a single scan of a large database. Most important limitation of apriori algorithm is to look up the candidates in each transaction. To do the same, a hash tree structure is used (Jacobsan et al., 2014). An extension of Apriori, i.e., Apriori-TID, signifies the current candidate on which each transaction is based, while a raw database is sufficient for a normal Apriori. Apriori and Apriori-TID when combined form Apriori-Hybrid. A prefix-tree is used to fix up the parting that occurs between the processes, counting and candidate generation in Apriori-DIC.

Artificial Intelligence and Data Mining Approaches in Security Frameworks

Подняться наверх