Читать книгу Machine Learning Techniques and Analytics for Cloud Security - Группа авторов - Страница 60
3.3.5 Illustration
ОглавлениеThe accessible dataset persists in two states, i.e., GN and GC, where GN denotes dataset of non-cancerous and GC denotes dataset of cancerous state. The designed algorithm is examined on both lung and colon dataset. Both GN and GC are combined and grouped together as one dataset G. Then, the dataset was transposed, i.e., rows became columns and columns became rows. A target variable Y was chosen and dataset was divided into dependent (Y) and independent (X) data.
In an M iterative process, a group of five genes is selected at random from the independent (X) data. Now, these five selected gene become X, i.e., dependent data, and Y, i.e., independent data, which is the same as earlier. This X × Y matrix is then divided into training and test data in 80:20 ratios.
After dividing into training and test data, the feature of the dataset is scaled down onto unit scale. Then, PCA is fitted onto training and test data of X to retain 95% of the variance. Then, LR was fitted on the training data of X and Y and predicted value is calculated using test data of X. At last, accuracy score was calculated by comparing the test data of Y and the predicted values. If the accuracy was found to be more than 85%, then those genes are considered as cancer mediating genes and stored in a new list as result set.