Читать книгу Biomedical Data Mining for Information Retrieval - Группа авторов - Страница 37

2.6 Application in Protein Folding Prediction

Understanding protein folding is inherent to understanding its function and its heterogenous nature. Cellular function is incomplete without proteins be it replication, transcription and translation, thus prediction of 3D or folded protein structure becomes very important to address various questions of molecular biology. Earlier various molecular biology techniques were used for determination of protein folding which was time consuming. The discovery of new protein sequences has been accelerated by next-generation sequencing techniques due to these methods being rapid and economical. The computational prediction methods that can accurately classify unknown protein sequences into specific fold categories in the shortest time possible is today’s requirement. Therefore computational recognition of protein folds holds a lot of importance in bioinformatics and computational biology. A number of efforts have led to generation of a variety of computational prediction methods and Artificial intelligence (AI) and machine learning (ML) have shown to hold great promise. In this chapter, available AI and ML methods and features have been explored and novel methods based on reinforcement learning have been discussed. Prediction of protein structure happens at four levels that is

1 i) 1-D prediction of structural features which is the primary sequence of amino acids linked by peptide bond
2 ii) 2-D prediction of which is the spatial relationships between amino acids that is alpha helix, beta turn and beta turn facilitated by hydrogen bonds
3 iii) 3-D prediction of the tertiary structure of a protein that is fibrous or globular involving multiple bonds facilitated by hydrogen bonds, Van der Wal forces, hydrophobic interactions
4 iv) 4-D prediction of the quaternary structure of a multiprotein complex which is made up of more than one peptide chain involving formation of sulfur bridge.

Thus a model development which allows the flexibility of bond formation and helps to predict a stable and functional protein structure has been facilitated to a great deal by AI and ML.

Prediction of protein structure is a complex problem as it is associated with various levels of organization and is a multi-fold process. There is a need for smart computational techniques for such purpose. AI is a great tool which when used with computational biology facilitates such prediction. Apart from determining the structure AI also aids in predicting protein structure crucial for drug development as well as in understanding the biochemical effect and ultimately the function.

A protein can be broadly described as a polymer where the individual amino acid can be considered as the monomers or the building blocks arranged in a linear chain and joined together by peptide bonds. The primary structure as described earlier is represented by a sequence of letters which represent the amino acids. The chain of amino acids of a protein folds into local secondary structures including alpha helices, beta strands, and nonregular coils [35, 36] in its native environment. The secondary structure elements are further packed to form a tertiary structure depending on hydrophobic forces and side chain interactions, such as hydrogen bonding, between amino acids [37–39]. The tertiary structure is described by the x, y and z coordinates of all the atoms of a protein or, in a coarser description, by the coordinates of the backbone atoms (Figure 2.1). The quaternary structure is formed by more than one protein chains interacting or assembling together to form a complexes structure. Theses protein complexes proteins interact with each other and with other biological macromolecules such as DNA, RNA and certain metabolites in a cell. This kind of interaction is required to carry out various types of biological functions such as enzymatic catalysis (protein complex can interact with a metal or non-metal referred to as co-enzyme), to gene regulation (interaction of transcription factors with DNA sequences), control of growth and differentiation (protein– protein interaction where ligand binding to receptor triggers a signal cascade pathway) and transmission of nerve impulses [40]. A protein’s function is and its structure are dependent on each other [37, 38, 41, 42] therefore, determination or prediction of protein structure accurately holds the key for its function determination. The most effective methods for finding protein structure since the inception of this field have been Nuclear Magnetic Resonance and X-ray crystallography which have the disadvantage of being time consuming and expensive. The recent advancement has been the introduction of cryo-electron microscope (cryo-EM) which produces high-resolution large-scale molecular structures very efficiently. Cryo-EM density maps make use of machine learning and artificial intelligence for prediction [43–46]. For such experiments protein crystal is needed which is the most disadvantageous or complex part of these methods because there are many liquid proteins which do not crystalize. Artificial intelligence comes to our aid here as it is a possible better pathway for sequencing these proteins [47, 48] due to the fact that they have proved their efficacy and accuracy of successful application in different fields like business [49], image recognition to name and can accurately and efficiently predict thousands of possible structures in shortest time by analysing big data where other methods have failed to deliver accurate and useful information.

Figure 2.1 The different level of organization of protein.

Most of the models are inaccurate and do not produce predicted proteins that contain useful information so using artificial intelligence, programs are trained using many numerically represented atomic features from the models (such as bond lengths, bond angles, residue-residue interactions, physio-chemical properties, and potential energy properties). Then the comparison of the prediction models output to the known crystal structures helps to assess the quality of the model and find the most accurate model. Models for predictions and prediction analysis are compared each year in one main gathering called the Critical Assessment of Structure Prediction (CASP). Every two years researchers from around the world submit machine learning methods designed for protein structure prediction [50] where the latest advancement has been the help of protein contact distance prediction [51] and addition of quality assessment (QA) category in CASP7 (2006) [51, 52].

AI which is time and resource efficient allows for more accurate prognosis and diagnosis of structures because the computers can analyze data and have perfect calculations and deeply analyze the details. These accuracies while may be very close to that of traditional approaches are still slightly stronger allowing confidence in the results. AI would also help in cost reduction and would not be an agent to replace researchers but rather working in conjunction with them. Artificial Intelligence is an exciting field which offers solutions to issues in finding structures of proteins which is crucial to drug development and the understanding of biochemical effects. A protein’s function is determined by its structure [53–56] as the evidence is there in many biochemical reactions, therefore elucidating a protein’s structure as seen in Table 2.1 is key to understanding its function. Function determination in turn is essential for any related biological, biotechnological, medical, or pharmaceutical applications which is much needed in today’s time of increased anti-microbial resistance and threat by unknown biological agents.

Table 2.1 Summary of database sources of protein structure classification.

Database sources	Websites	References
PDB	http://www.rcsb.org/pdb/	[57]
UniProt	http://www.uniprot.org/	[58]
DSSP	http://swift.cmbi.ru.nl/gv/dssp/	[59]
SCOP	http://scop.mrc-lmb.cam.ac.uk/	[60]
SCOP2	http://scop2.mrc-lmb.cam.ac.uk/	[61]
CATH	http://www.cathdb.info/	[62]

The various predictive models for protein structure prediction are hidden Markov models, neural networks, support vector machines, Bayesian methods, and clustering methods.

Hidden Markov Model for Prediction HMMs are among the most important techniques for protein fold recognition. In the HMM version of profile–profile methods, the HMM for the query is aligned with the prebuilt HMMs of the template library. This form of profile–profile alignment is also computed using standard dynamic programming methods. Earlier HMM approaches, such as SAM [63] and HMMer [64], built an HMM for a query with its homologous sequences and then used this HMM to score sequences with known structures in the PDB using the Viterbi algorithm, an instance of dynamic programming methods. This can be viewed as a form of profile-sequence alignment. More recently, profile–profile methods have been shown to significantly improve the sensitivity of fold recognition over profile–sequence, or sequence–sequence, methods [65].

Neural Networks (NNs) It is very challenging to determine the structure of a protein if its sequence is given and hence making function determination more difficult. Since a lot of molecular interaction and various levels of folding are involved in a functional protein simple input of sequence will not result in desired output. Deep learning methods are rapidly evolving field in the context of complex relationships between input features and desired outputs which has been put to great use in structure prediction. Various deep neural network architectures resembling the neural network of a human have been proposed which includes deep feed-forward neural networks, recurrent neural networks and neural Turing machines and memory networks. Such advancements are making this field more competitive and accurate and a comparison can be made to a human brain where it receives so many information as inputs but is able to analyze and come to a logical conclusion.

Pattern recognition and classification are important tools of NN. Examples of early NN methods that are still widely used today are PHD [66, 67] PSIPRED [68] and JPred [69] though advancement has occurred to a great deal as Deep neural network (DNN) models have been shown have an advantage of performance in image and language based problems [70] and has been seen to extend to some specific CASP areas such as residue-residue contact prediction and direct use for accurate tertiary structure generation [71–75].

Support Vector Machines (SVMs) Support Vector Machine (SVM) is a supervised Machine Learning technique that has been used to rank protein models [76]. SVM has been put to use in pattern classification problems related to biology. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified which is based on

1 Known structures of protein in the data bank
2 Evolutionary relationships of the predicted protein
3 The various principles of bond formation governing the 3-D structure of protein.

The advantages of SVM include avoidance of over-fitting very effectively which is a disadvantage with several other methods and is able to manage large feature spaces, and condensation of large amount of information data.

Bayesian Methods The most successful methods for determining secondary structure from primary structure use machine learning approaches that are quite accurate, but they do not directly incorporate structural information. There is a need to determine higher order protein structure which can provide a better and deeper understanding of protein’s function in the cell as structure and function are strongly related. Various computational prediction methods have been developed for the prediction of secondary structure if the primary amino acid sequence is available and one such computational methods is the Bayesian method

The knob-socket model of protein packing in secondary structure forms the basis of Bayesian model. As it is known that when packaging of protein may result in residues that are packed close in space but distant in sequence if the primary structure is seen [77, 78] which is not taken into account by several other methods. The Bayesian model method considers the packing influence of residues on the secondary structure determination. Thus this method has an advantage over other methods of having constructs for the direct inclusion and prediction of the secondary states of coil and turn. Where other secondary structure prediction methods are indirect and do not make direct prediction of coil structure of alpha helix and beta sheet. The secondary folding is very much dependent upon the surrounding environment (aqueous/non aqueous) as a lot of hydrogen bonding and hydrophobic is involved. Thus this method helps in developing the understanding of the environment responsible for secondary structure formation.

Clustering Methods A protein rarely performs its function in isolation, various kinds of interaction is needed to perform its function [79] as discussed earlier in this chapter in context to quaternary structure. Protein–protein interactions are thus fundamental to almost all biological processes [80] and it’s really important to understand this phenomenon. Increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level in terms of interactions taking place. Protein–protein interactions can be studied by advance high-throughput technologies such as yeast-two-hybrid, mass spectrometry, and protein chip technologies and making available huge data sets of such interactions [81] which can be put to great use in structure prediction. In computation analysis such protein– protein interaction data can be naturally represented in the form of networks. This network representation can provide the initial global picture of protein interactions on a genomic scale and can also help to build an understanding of the basic components and organization of cell machinery. In Clustering method protein interaction network is represented as an interaction graph. In this graphical representation the proteins are as vertices (or nodes) and interactions as edges. This method has been put to use in the study of surface or topological properties of protein interaction including the network diameter, the distribution of vertex degree, the clustering coefficient and shows that there is scale-free network [82–85] and effects in a very small area [86, 87]. It has been observed and shown that clustering protein interaction networks is an effective approach for system biology to understand the relationship between the organization of a network and its function [88] making it a very effective tool.

The proteins are grouped into sets (clusters) helping to demonstrate greater similarity among proteins in the same cluster than in different clusters. The clusters have two which are protein complexes and functional modules. Protein complexes are groups of proteins that interact with each other at the same time and place which form a single multimolecular structure as evident in RNA splicing and polyadenylation machinery, protein export and transport complexes to name a few [89]. The difference between protein complex and functional modules is that the functional module consists of proteins binding each other at a different time and place and participating in a cellular process. Example of functional module includes the yeast pheromone response pathway, MAP signalling cascades, etc. [90] which initiates with an extracellular signaling leading to a signal cascade pathway resulting in gene activation and other processes.

Biomedical Data Mining for Information Retrieval

Подняться наверх