Читать книгу Machine Learning Techniques and Analytics for Cloud Security - Группа авторов - Страница 53
3.1 Introduction
ОглавлениеAll cancer is the result of gene mutations. Mutations may be caused by several factors. Normal cells turn into cancerous cells largely due to mutations in their genes. Often, it is observed that a cell becomes cancer cell, when several mutations are involved. The mutations can influence various genes that control the division and growth of cells. Identifying the genes having correlation with certain cancer is a challenging task. Gene expression data obtained by high-performance–based technology, viz., DNA sequencing and DNA microarray, both have been proven to have high impact in cancer research [1]. Gene selection can help in many ways like cancer treatment, proper diagnosis, and drug discovery [2]. With the invention and advancement of DNA microarray technology, monitoring the levels of expression of thousands of genes is possible but the key task is to derive information from the vast amount of biological data and realizing the underlying patterns [3]. Over the past few decades, a lot of tools based on various computational techniques have been developed in the domain of cancer classification for making advancement in medical science which essentially improves the competence of biologists and physicians for detecting cancer mediating biomarkers [4].
Cancer classification with the help of analyzing microarray gene expression data is a conventional method nowadays. The biological relevance of genes substantially influences the accuracy of cancer classification. Thus, selection of genes plays a pivotal role and might be observed as main factor for classification of cancer on the basis of microarray data. The process of gene selection relates to the task of selecting a few significant genes that better characterizes the variations [5]. It is always effective to put focus some important genes which are obviously smaller in number and might differ in their expression levels from non-cancerous state to cancerous one. Thus, from the whole genome, only a few number of genes which are dominant should be identified by using effective gene selection method [6]. But extracting information from the vast amount of biological data and understanding the patterns is the most appealing task. This correlation is more pronounced when these genes are located on the same biological path. In this situation, the procedures traditionally used for feature selection often overlook the relationships between genes and select only a few the set of genes which are mostly linked. The irrelevant genes not only contribute to lower output of the classification but also bring additional difficulties in locating genes which are descriptive in nature [7].
Analyze microarray data and selection of informative genes is always a demanding task. Due to presence of diversity and complexity in different types of cancer, the task is more challenging. With the emergence in the field of biotechnology a bulk amount of data is being generated by utilizing high-density oli-gonucleotide chips and cDNA arrays [8, 9]. Researchers now can measure thousands of gene expression data simultaneously. But there is lack of suitable algorithm to extract knowledge and mine the information from this type of biological data source which is very much significant. So, the increased demand always persists to explore and design suitable algorithm/s. While analyzing microarray data, one of the most significant applications is to classify the tissue samples that belong to normal and cancerous state. Nevertheless, during such application, it has always been observed that a large number of genes are identified which are irrelevant. So, this genes has got no impact on clinical application, and as a result, the efficiency of the method gets compromised [10, 11]. On the other side of the coin, working and interpreting with the huge number of genes incurs lack of feasibility. Thus, it is obvious to select accurate number of relevant genes by analyzing microarray data and has become really a promising one. Selecting these important genes is very much important from different angles of medical science which includes drug discovery, targeted therapy, prognosis, and sometimes early detection [12, 13].
Gene expression data generated through high-throughput technology comes in the form of matrix where each rows represents gene expression level but columns are the samples. As gene expression is considered as the features which is a very large but the experimental data, i.e., the samples are very few in numbers so it becomes really a complex task to work with. This is a real problem to start with the work with such huge dimensionality. Many algorithms based on different Artificial Intelligence (AI) techniques have been experimented over the years to find solution. Different algorithm based on Machine Learning (ML) approach, a branch of AI has been used over the years as an effective analytical tool this type data [14]. In ML technique model, data used in past is utilized in order to predict future result. Different learning methods based on statistical and probabilistic model and optimization techniques can be implemented for analyzing data. Learning methods like Logistic Regression (LR), artificial neural networks (ANN), K-nearest neighbor (KNN), decision trees (DT) and Naïve Bayes are widely used in different context [15, 16]. Two categories of learning in ML techniques are mainly used, i.e., supervised and unsupervised learning. The learning model implemented through learning from known classes (labeled training data) is termed as supervised learning. On the other hand, unsupervised learning methods learn from unknown class data often termed as unlabeled training data [17]. Algorithms designed by ML approach have been used for different purpose like classification of groups and key feature training and recognition. The real power of ML algorithms is it could recognize patterns from datasets which are large, noisy, and difficult to discern. This property is very much useful to process complex genomic data, specifically in the field of cancer related studies [18, 19].
While building a prediction model, LR is reckoned as a popular method where the outcome is binary and has been expanded to provide classification of disease with microarray data. Here, it is necessary to incorporate a feature (gene) selection technique and should be induced to penalize the logistic model. The fundamental reason is that, here, the number of genes is very large compare to number of samples. So, selection of proper model in this procedure needs new statistical methods. This is important because while predicting error assessment, the step for selecting features if ignored, could have impact of severely downward biased. The widely used methods which are mostly generic like cross-validation and non-parametric bootstrap may be not so effective owing to the huge vulnerability in predicting the error estimation process. The classification of diseases like cancer using microarray data has been considered the subject of extensive research in order to provide more precise diagnostic methods than the conventional pathological approach alone can provide. The expression of genes can also be used to predict survival time, disease prognosis and treatment response. The overall impact is very much significant as all the factors are having major clinical consequences. To design a logistic prediction model using microarray data, however, has got a fundamental difference from the standard logistic model owing to the observed number of genes, which often becomes thousands in number while the number of arrays (samples) observed is generally very lesser which is often less than one hundred. A common wise used approach is to combine a step in gene selection with a penalized inference of probability, called selection of features, which selects a subset of genes for inclusion in the LR model.
LR is a tool borrowed from the domain of statistics by ML. This method is used for classification problems which are binary in nature (problems having two values in the class). LR is widely used in the biological sciences where the dependent variable is categorical, i.e., it is a widely used method to build predictive models where the outcome is binary and is extended for utilizing as disease classification using microarray data [20]. In the present article, we have developed an algorithms using LR model to select feature (gene) whose mutation is having correlation with certain cancers. While designing proper gene selection algorithm using a ML model, it is a challenging task to reduce the computational complexity as because the dataset is of huge volume. The total number of genes (features) is very large in number. In the LR model, having too many features can cause of over fitting and performance of the algorithm is compromised [21]. There are many standard techniques which are widely used to reduce the dimensionality such as Kernel PCA, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA). It is observed that, when the number of samples per class is smaller, PCA performs better, while LDA operates better for large datasets of multiple classes. While minimizing the dimensionality class repairability is considered as an essential factor. As our aim is to develop a binary classifier model, where we have overcome this by developing a hybrid approach where the number of features has been reduced using PCA. Although there are many techniques to do this, with PCA, loss of data is minimum in the context of the dataset it is appropriate to get better outcome. After that, the output of PCA is applied LR model for prediction of genes. A threshold value has been calculated and set for this binary classification which is applied on some test data to select which genes are selected as candidate genes or cancer mediating genes. The statistical and biological validation of obtained resultant set of genes has been accomplished at end.