Читать книгу Machine Learning Techniques and Analytics for Cloud Security - Группа авторов - Страница 64
3.4.3 Result Set Validation
ОглавлениеThe generated result set genes for lung and colon dataset having correlation with cancers have been validated biologically using NCBI database. NCBI provides a gene database (http://www.ncbi.nlm.nih.gov/Database) where the disease mediating gene list corresponding to a specific disease can be obtained. The list is arranged in terms of relevance of the genes. We have got different sets of genes for lung cancer and colon cancer. The algorithm has selected 886 genes for lung and 207 for colon cancer as mutated genes. For lung expression data, we have compared this set of genes with 1,067 genes from NCBI. Here, we have identified 102 common in both the sets. We call these genes TP genes (Figure 3.4). Thus, 784 (886 − 102) genes are not in the list of genes obtained from NCBI. We denote these genes as false positive (FP) and 965 (1,067 − 102) genes are identified as false negative (FN). Likewise, for colon data, 1,223 genes are in the NCBI database. In this case, our algorithm has identified 207 genes. So, when compared with NCBI database, 85 genes got matched and marked as TP and 1,138 (1,223 − 85) genes are identified as FN and 122 (207 − 85) genes are FP (Figure 3.3).
It is very important while developing an efficient algorithm using ML model with a skewed dataset. For example, if the dataset is about cancer detection, then the task becomes more significant. Accuracy alone cannot decide for a skewed dataset whether the algorithm is working efficiently or not. What happens is that if we see in the dataset that in 99% of the time, then there is no cancer. In a binary classification problem, we can easily predict 0 all the time (predicting 1 if cancer and 0 if no cancer) to get a 99% accuracy. If we implement that model, then we will have a 99% accurate model based on ML algorithm but we will never detect cancer. If someone has cancer, then s/he will never get detected and will not get treatment. In our problem, we want to detect cancer mediating genes whose expression level changes significantly from normal state to cancerous state. So, here also, only accuracy is not going to work. There are different evaluation matrices that can help with these types of datasets. Those evaluation metrics are called precision-recall evaluation metrics. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall. The F-score is commonly used effectively for many kinds of ML models. Moreover, for a binary classification problem, it is very much significant to analyze the accuracy vs. F-score to evaluate the efficiency of the model. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of FN and FP. On the other hand, F-score is an effective measure when there are either differing costs of FP or FN or where there is a large class imbalance. As our proposed method works with gene expression data where number of genes is very large in number but the number of genes whose mutation is correlated to cancer will be very less, so in this case, the accuracy would be misleading, since a classifier that classifies set of genes not related to cancer would automatically get 90% accuracy but would be useless for the proposed work and hence will have little contribution in real-world application specially in the field of medical science. As a result, F-score has been given importance to evaluate the efficacy of the proposed model by proper application precision and recall.
Figure 3.3 FN, TP, and FP values for colon.
Figure 3.4 FN, TP, and FP values for lung.
Precision is the fraction of TP examples among the examples that the model classified as positive. In other words, it is the number of true positives divided by the number of FP plus true positives. Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total number of positive examples. In other words, this is the number of true positives divided by the number of true positives plus FN. In our model, the resultant set of genes has been validated using NCBI database for both colon and lung. From the diagram, the intersection part for colon dataset (Figure 3.3), and for lung dataset (Figure 3.4), the number of TP genes is identified. At the same time, FP and FN values are also identified from the figures in the same way.
Further, we have calculated the precision, recall, and F-score values to check how good our model is. Precision tells us how precise/accurate our model is out of those predicted as positive and how many of them are actual positive. The formula that is used to calculate for precision [Equation (3.7)] and recall [Equation (3.8)] is clearly mentioned.
Recall calculates the actual number of positives recorded by our model, i.e., what proportions of actual positives was identified correctly.
Unfortunately, a trade-off was seen between both precision and recall. With higher the value of Recall, then lower will be the precision value and vice versa. As a result, we are getting different impression of the outcome in the result set found for two different datasets (Figures 3.5 and 3.6). In order to overcome this trade-off, F1-score has been calculated, to find an optimum point where both the precision and recall values are high.
Figure 3.5 F-score for lung and colon using precision.
Figure 3.6 F-score for lung and colon dataset using recall.
Figure 3.7 F1 score for lung and colon dataset.
Further, we have computed F1 score values using the formula [Equation (3.9)] for two datasets considered here and it is observed that our PC-LR method generates optimal result for colon dataset compared to lung dataset (Figure 3.7).
It should be noted that the generalized formula for F-score is actually known as Fβ-score. The F-score when used in a regulated manner helps us to weight recall and precision more accurately for our working model. The equation in that becomes little different.
Here in the equation of Fβ-score [Equation (3.10)], the factors are indicating about in what extent recall is having more importance over precision. For instance, setting the value of β to 2 indicates that recall is being given importance two times higher than precision. The standard practice is to set the value of β to 1, while using in F-score [Equation (3.10)] which causes our equation to be as Equation (3.9) and by observing the comparative study by giving different weigh on precession and recall, it can be concluded that F1 score can measure the performance of the working model more accurately.
The proposed method has identified 102 genes for lung cancer and 85 genes for colon cancer as cancer mediating genes. The result is generated by validating with NCBI database where some already identified genes are available. Some of the gene symbols are given in the tables below generated by our developed methodology. Table 3.1 contains some the significant TP gene symbols for lung cancer and Table 3.2 contains the same for colon cancer.
Table 3.1 Resultant genes (gene symbols) identified by PC-LR method.
Significant true positive genes for lung cancer | |||
---|---|---|---|
KRAS | CHRNA3 | MIF | GSTP1 |
TP53 | SOX9 | MAP2K1 | VDR |
IGFIR | TNF | RET | SYK |
IFGBP3 | CDH2 | MET | PGR |
STAT3 | CDH1 | TGFB1 | IL10 |