Читать книгу Machine Learning Techniques and Analytics for Cloud Security - Группа авторов - Страница 59

3.3.4 Interpretation of the Algorithm

The pictorial representation of the algorithm gives clear idea of the working model (Figure 3.2). The proposed algorithm works with gene expression dataset that belongs to both normal and cancerous state which is available in the form of a matrix where rows are genes and columns are the samples. The matrix is transposed so that samples were made as rows and all the genes were made as columns. So, this transposed matrix is given as the input. The whole dataset is partitioned into two categories: one is used for training purpose and the other one is for testing purpose. The model gets trained with the help of training data, and then, test data is used to measure the correctness of the model. The division of the data is done with the ratio 0.2 that means for training, 80% of the data is applied for training the model where as 20% data is applied for testing the same model.

The gene expression data values in the dataset vary in size. The numerical columns of the dataset need to be reduced to a common scale without any distortion of the differences lying in the range of values; therefore, standardization is needed to be used. Standardization is a form of scaling where the values are considered as centered on the basis of mean with a standard deviation taken as another component. Now, to start the iterative process for working a set of “r” genes were selected at random, and these genes were passed to train the model. Once these genes are selected, these are marked so that they will not be selected again in the iterative process. As in the dataset, it is observed that number of features (genes) is very large in compare to number of samples. In order to reduce curse of dimensionality, PCA was applied on the above selected “r” genes and a certain percentage, say, α%, of the variance is tried to be retained. Applying PCA, these genes were passed to train LR model. After training the model, test data is used to predict the outcome. Now, these predicted outcomes were checked against the actual outcomes of the test data and accuracy is calculated. If its accuracy level founds to be more than 85%, then these genes are extracted and stored in a list of candidate genes. The entire process gets repeated until all the genes were marked as selected in the dataset and accuracy was found to be considerate.

Machine Learning Techniques and Analytics for Cloud Security

Подняться наверх