Читать книгу Biomedical Data Mining for Information Retrieval - Группа авторов - Страница 38

2.7 Role of Artificial Intelligence in Computer-Aided Drug Design

High throughput screening (HTS) is a set of techniques that are capable of identifying biologically active molecules with desired properties from any compound database of billions of compounds. The prediction and identification of active compounds with high accuracy and activity are crucial to decrease the time taken to discover potent drugs. Different medicinal chemistry-related companies use screening techniques to identify active compounds from drug databases in a significantly less amount of time. The decrease in search space or targeted search will reduce the overall cost of the drug discovery process. The critical problem is how to establish a relationship between the 3D structure of the lead molecule and its biological activity. QSAR is a technique that can able to predict the activity of a set of compounds using the derived equations from a set of known compounds [91]. While in QSPR (quantitative structure–property relationships), one predicts biological activity, using the physicochemical properties of known compounds as a response variable. Accurate prediction of the activity of chemical molecules is still a persistence issue in drug discovery. It is a general phenomenon in structural bioinformatics that if the two protein structures share structural similarities, then their functions may also be the same. Nevertheless, this is not always true in the case of chemical structures, where minute structural differences in pairs of compounds will lead to change in their activity against the same target receptor. This is an activity cliff problem which is being a hot topic of debate among computational and medicinal scientists [92, 93].

The lock-and-key hypothesis and induced fit model hypothesis deal with the biochemistry of binding of a ligand at the receptor. In general, a ligand–receptor complex comprises of a smaller ligand which attaches to the functional cavity of the receptor. The 3D structure information of both ligand, as well as receptor, is essential in order to understand their functional role. There is a change in 3D conformation of receptor protein upon binding of ligands at the active site and thus leads to change in their functional state. X-Ray Crystallography, Nuclear Magnetic Resonance (NMR), Electron Microscopy are the currently available experimental techniques to predict the 3D structure of proteins. Since there is a considerable gap between available protein sequences and their 3D structures, one can harness bioinformatics techniques, namely molecular modeling, to predict their 3D structures in a less amount of time with comparable accuracy. Molecular docking is a technique that can be used to predict the binding mode of ligand at the receptor if their 3D information is available. It is the most commonly used for pose prediction of ligand at the active site of the receptor. The approach of identifying lead compounds using 3D structure information of receptor–protein is known as Structure-Based Drug Design (SBDD). Nowadays, the process of identifying, predicting and optimising the activity of small molecules against a biological target comes under SBDD domain [94–96].

Ligand-based drug design (LBDD) is another approach of drug designing, applicable only when 3D structural information of the receptor is unavailable. LBDD mainly relies on the pre-existing knowledge of compounds that are known to bind with the receptor. The physicochemical properties of known ligands are used to predict their activity and develop SAR to screen unknown compounds [97]. Although artificial intelligence can be applied in both SBDD and LBDD approaches to automate the drug discovery process, its implementation in the LBBD approaches is more common these days. Some recent methods like proteochemometric modeling (PCM) try to extract the individual descriptor information from both ligands as well as the receptors, and also the combined interaction information [98]. The machine learning classifiers use the individual descriptor, as well as cross-descriptor information, for predicting the bioactivity relations.

Biological activity is a broad term that relates to the ability of a compound/target to achieve the desired effect [99]. The bioactivity or biological activity may be divided into the activity of receptor (functionality) and activity of compounds. While in pharmacology, the biological activity is replaced by pharmacological activity, which usually represents the beneficial or adverse effect of drugs on biological systems. The compound must possess both the activity against the target as well as permissible physicochemical properties in order to establish them as an ideal drug candidate. The absorption, distribution, metabolism, excretion and toxicity (ADMET) profile of a compound is required to predict the bioavailability, biodegradability and toxicity of drugs. Initially, the simple descriptor-based statistical models were created for predicting the bioactivity of drug compounds. Later on, the target specificity and selectivity of compounds were increased many folds due to the inclusion of machine learning-based models [100]. The machine learning classifiers may be built and trained based on preexisting knowledge of either molecular descriptors or substructure mining in order to classify new compounds.

One can train the classifiers, and classify the new compounds considering either single or combination of parameters: activity (active/non-active), drug-likeness, pharmacodynamics, and pharmacokinetics or toxicity profiles of known compounds [91]. Nowadays, a lot of open-source as well as commercial applications, are available for predicting skin sensitisation, hepatotoxicity, or carcinogenicity of compounds [101]. Apart from this, several expert systems are in use for finding the toxicity of unknown compounds using knowledgebase information [102, 103]. These expert systems are artificial intelligence-enabled expert systems that are using human knowledge (or intelligence) to reason about problems or to make predictions. They can make qualitative judgements based on qualitative, quantitative, statistical and other evidence provided to them as an input. For instance, DEREK and StAR use the knowledge-based information to derive new rules that can better describe the relationship between chemical structure and their toxicity [102]. DEREK uses a data-driven approach to predict the toxicity of a novel set of compounds given in the training dataset and compare them to given biological assay results to refine the prediction rules. Toxtree is an open-source platform to detect the toxicity potential of chemicals. It uses the Decision Tree (DT) classification machine learning algorithm based classification model to estimate toxicity. The toxicological data of chemicals derived from their structural information is used as an input to feed the model [104].

Besides expert systems, there are also some other automated prediction methods like Bayesian methods, Neural Networks, Support Vector Machines. Bayesian Inference Networks (BIN) is among one of the crucial methods that allow a straightforward representation of uncertainties that are involved in the different medical domains involving diagnosis, treatment selection, prediction of prognosis and screening of compounds [105]. Nowadays, doctors are using these BIN models in the prognosis and diagnosis. Use of BIN models in the ligand-based virtual screening domain tells their successful implications in the field of drug discovery. A comparative study was done to find the efficiency of three models: Tanimoto Coefficient Networks (TAN), conventional BINs and BIN Reweighting Factor (BINRF) for screening billions of drug compounds based on structural similarity information [106]. All three models use MDL Drug Data Report (MMDR) database for training as well as testing purposes. The ligand-based virtual screening, which utilizes the BINRF model, not only significantly improved the search strategy, it also identified the active molecules with less structural similarity, compared to TAN and BIN-based approaches. Thus, this is an era of the integrative approaches to achieve higher accuracy in drug or drug target prediction.

Bayesian ANalysis to determine Drug Interaction Target (BANDIT), uses a Bayesian approach to integrate varied data types in an unbiased manner. It also provides a platform that allows the integration of newly available data types [107]. BANDIT has the potential to expedite the drug development process, as it spans the entire drug search space starting from new target identification and validation to clinical candidate development and drug repurposing.

Support Vector Machine (SVM) is a supervised machine learning technique most often used in knowledge base drug designing [108]. The selection of appropriate kernel function and optimum parameters are the most challenging part in the problem modelling, as both parameters are problem-dependent. Later on, a more specific kernel function is designed that can control the complexity of subtrees by using parameter adjustments. The SVM model integrated with the newly designed kernel function successfully classifies and cross-validates small molecules having anti-cancer properties [109]. Graph kernels-based learning algorithms are widely in SVMs, and they can directly utilise graph information to classify compounds. The graph kernel-based SVMs are employed to classify diverse compounds, to predict their biological activity and to rank them in screening assays. Deep learning algorithms that mimic the human neural system, artificial neural network (ANN) also have applications in the drug discovery process. The comparable robustness of both SVM and ANN algorithms were checked in term of their ability to classify between drug/non-drug compounds [110]. The result is in support of SVM as it can classify the compounds with higher accuracy and robustness compared to ANN.

Other machine learning algorithms: Decision tree, Random forest, logistic regression, recursive partitioning are also successfully applied to classify compounds using relationship criteria between their chemical structure and toxicity profiles [111]. The comparative study of ML algorithms shows that non-linear/ensemble-based classification algorithms are more successful in classifying the compounds using ADMET properties. Random Forest algorithms can also be used in ligand pose prediction, finding receptor-ligand interactions and predicting the efficiency of docking simulations [112]. Nowadays, Deep Learning (DL) methods are achieving remarkable success in the area of pharmaceutical research starting from biological-image analysis, de novo molecule design, ligand– receptor interaction to biological activity prediction [113]. So the continuous improvements in machine learning and deep learning algorithms will help to achieve desired results with higher prediction accuracy in the drug designing field.

Multiple descriptors represent the molecular data in terms of their structural and physicochemical features. These descriptors are responsible for diverse bioactivity of compounds [114]. Apart from descriptor-based bioactivity prediction of chemicals, substructure mining is also an established technique in the field of drug discovery. The substructure mining is also a data-driven approach that uses a combination of algorithms to detect the most frequently occurring substructures from a large subset of the known ligands [115]. There are two ways to use the substructure mining: one way is to use a predefined list of candidate scaffolds. The substructure mining algorithm identifies and extracts all the candidate scaffolds present in known compounds of a given database. While the second approach of substructure mining adaptively learns the substructures from the compounds. Both the ways are capable of getting all the significant 2D substructures from any chemical databases [116]. The popularity of the substructure mining approaches is highly appreciable for establishing a common consensus among medicinal chemists who later on start treating chemical compounds as a collection of their sub-structural parts. Application of the approach to establish structure–activity relationships will build more confidence in stating that biological properties of molecules are dependent upon their structural properties.

Later on, several substructure mining algorithms have been developed to accommodate the needs of an ever-changing drug discovery process [117]. The subgraph mining approach is unique as it is free from any kind of arbitrary assumption, compared to other approaches. In other words, the current subgraph mining techniques are capable of retrieving all frequent occurring subgraphs from a given database of chemical compounds in significantly less time with minimum support [118]. Furthermore, as described above, the idea behind these techniques is to enable us to find the most significant subgraph out of all possible subgraphs. Shortly, the use of Artificial intelligence-based techniques in medicinal chemistry will become more complex, due to the increasing availability of huge repositories containing chemical, biological, genetic, and structural data. The implementation of the complex algorithm on ever-increasing data volume for searching a new, safer and more effective drug candidates leads to the use of quantum computing and high-performance computing. In summary, we believe that these techniques will become a much more significant part of drug discovery endeavours within a very short time.

Biomedical Data Mining for Information Retrieval

Подняться наверх