Читать книгу Machine Learning Techniques and Analytics for Cloud Security - Группа авторов - Страница 56
3.3.1 Description
ОглавлениеThe implementation of our work started with gene expression data which is mathematically viewed as a set G = {g1, g2, g3,…, ga}. Each member gi of G can be further expressed as gi = {gi1, gi2, gi3, …, gib}. Thus, the entire mathematical expression G can be considered as vector of vectors. In this context, gi can be thought as a vector/gene comprising of feature. More specifically, the entire dataset is represented in a matrix format of dimension a × b where a is number of genes and b is number of samples and a >> b. So, the number of samples is very less in compare to number of genes, considered as features. Two separate sets of data belong to two different states: normal and carcinogenic are taken here for generating the result from the proposed method. The whole dataset is represented mathematically as a set G = {GN, GC} where the dataset GN represents normal or non-cancerous state and GC belongs to cancerous state. Two such datasets, viz., lung and colon pertaining to non-cancerous and cancerous states, are studied to get experimental result, i.e., the set of genes whose mutations have been observed.
In the present work, PCA is applied for the purpose of subgrouping the variables that preserves as much information present in the complete data as possible and also to speed up ML algorithm. The gene expression data G is presented here as a mathematical notation depicted as G = {g1, g2, g3,…, ga}. The dataset used here belongs to two states, i.e., normal and cancerous states and treated here for determining the genes associated with cancer. The proposed algorithm works by reducing the size of features using PCA and then applying LR model on both datasets.
Here, we have applied LR as because the dependent variable (target) is categorical in nature. A threshold value has been considered here which helps to predict the class belongingness of a data. On the basis of the set threshold value, the predicted probability is realized for classification. After calculating the predicted value if found ≥ threshold limit, then gene is said to be cancerous in nature, otherwise non-cancerous. Considering x as independent variable and y as dependent variable in our LR model, the hypothesis function h⊖(x) ranges between 0 and 1. As it works as a binary classifier, the result of prediction with the classification becomes y = 0 or y = 1. The hypothesis function h⊖(x) actually can have values <0 or > 1. The mathematical expression in logistic classification used in the method is defined as 0 ≤ h⊖(x) ≤ 1.
As in our Logistic Regression model, we want 0 ≤ h⊖(x) ≤ 1, so our hypothesis function might be expressed as
(3.1)
Replacing ⊖Tx by t, the equation becomes
(3.2)
The above equation is known as sigmoid function
(3.3)
If “t” proceeds toward infinity, then the predicted variable Y will become 1. On the other hand, if “t” moves to infinity toward negative direction, then prediction of Y will be 0.
Mathematically this can be written as
(3.4)
Computing the probability that y = 1 when x is given and it is parameterized by θ
(3.5)
(3.6)
While implementing the proposed method, we are selecting a bunch of r genes at random. This dataset of r × n (r denotes number of genes and n denotes number of samples) is partitioned into two sets, i.e., train and test. A certain percentage, say, p%, of the data is chosen as training set and rest is used as test set.
Features need to be scaled down before applying PCA. Standard scalar is used here for standardizing the features of the available dataset. Here, it is taken a onto unit scale (mean value is taken as 0 and variance is taken as 1). Now, PCA is applied with α as the number of parameters signified as components. It means that scikit-learn choose the minimum number of principal such that α% of the variance is retained.
After applying PCA, selected genes are fitted into the LR model. Test data and predicted data values are compared, and accuracy score is calculated. To obtain the gene with good accuracy score, the iterative LR and PCA was fitted at each iteration step and every time r random genes were selected. After the completion of the iterative process, the final list was sorted in descending order of the calculated accuracy score and top genes were selected.
From then “a” number of genes aCr different combinations can be made by selecting r genes at random. Our algorithm works on M such combinations.