Читать книгу Biological Language Model - Qiwen Dong - Страница 12

Оглавление

Chapter 3

Amino Acid Encoding for Protein Sequence

3.1Motivation and Basic Idea

The digital representation of amino acids is usually called feature extraction, the amino acid encoding scheme, the residue encoding scheme, etc. Here, we use amino acid encoding as the terminology of choice. It should be noted that amino acid encoding is different from protein sequence encoding. Protein sequence encoding represents the entire protein sequence by using an n-dimensional vector, such as the n-gram,1 pseudo amino acid composition,2,3 etc. Since the amino acid-specific information is lost, protein sequence encoding can be only used to predict sequence-level properties (i.e. protein fold recognition). Amino acid encoding represents each amino acid of a protein sequence by using different n-dimensional vectors; thus, its vector space for a protein sequence is nL (L denotes the length of the protein sequence). By combining with different machine learning methods, amino acid encoding can be used in protein property prediction both at the residue level and the sequence level (i.e. protein fold recognition, secondary structure prediction, etc). In the past decades, various amino acid encoding methods have been proposed from different perspectives.4–6 The most widely used encodings are one-hot encoding, position-specific scoring matrix (PSSM) encoding, and some physicochemical property encodings. In addition to those encodings, some other encodings have also been proposed, such as the encoding estimated from interresidue contact energies,7 the encoding learned from protein structure alignments8 and the encoding learned from sequence context.9 These encoding methods explore amino acid encoding from new perspectives, and can be the complement of the above encodings. Kawashima et al.10 have proposed a database of numerical indices of amino acids and amino acid pairs, and this contains information on the physicochemical and biochemical properties of amino acids.

3.2Related Work

3.2.1 Binary encoding

The binary encoding methods use multidimensional binary digits (0 and 1) to represent amino acids in protein sequences. The most commonly used binary encoding is one-hot encoding, which is also called orthogonal encoding.5 For one-hot encoding, each of the 20 amino acids is represented by a 20-dimensional binary vector. Specifically, the 20 standard amino acids are fixed in a specific order, and then the ith amino acid type is represented by 20 binary bits with the ith bit set to “1” and others to “0”. There is only one bit equal to “1” for each vector; hence, it is called “one-hot”. For example, the twenty standard amino acids are sorted as [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y]; the one-hot code of “A” is 100000000000000000000, that of “C” is 01000000000000000000, and so on. Since protein sequences may contain some unknown amino acids, it should be noted that one more bit is needed to represent the unknown amino acid type in some cases, and the dimension of its binary vector will be 21.11

Because one-hot encoding is a high-dimensional and sparse vector representation, there is a simplified binary encoding method based on conservative replacements through evolution.12 Deriving from the point accepted mutation (PAM) matrices,13 the 20 standard amino acids are divided into six groups: [H, R, K], [D, E, N, Q], [C], [S, T, P, A, G], [M, I, L, V] and [F, Y, W]. Six dimensional binary vectors are used to represent amino acids based on their groups. Another low-dimensional binary encoding scheme is the binary 5-bit encoding introduced by White and Seffens.14 Theoretically, the binary 5-bit code could represent 32 (25 = 32) possible amino acid types. In order to represent the 20 standard amino acids, the ones encoded by all 0s, the ones encoded by all 1s and those encoded with 1 or 4 ones (5 + 5 = 10) are removed, finally leaving 20 encodings (32 − 1 − 1 − 10 = 20). This binary 5-bit encoding uses a 5-dimension binary vector to take the place of the 20-dimension vector of one-hot encoding, which may lead to less model complexity.5

3.2.2 Physicochemical properties encoding

From the perspective of molecular composition, a typical amino acid generally contains a central carbon atom (C) which is attached with an amino group (NH2), a hydrogen atom (H), a carboxyl group (COOH) and a side chain (R). The side chains (R) are usually carbon chains or rings (except for proline) which are attached to various functional groups.5 The physicochemical properties of those components play critical roles in the formation of protein structures and functions; thus, these properties can also be used as features for protein structure and function prediction.15

Among various physicochemical properties, the hydrophobicity of the amino acid is believed to play a fundamental role in organizing the self-assembly of a protein.16 Based on the propensity of the amino acid side chain to be in contact with a polar solvent like water, the 20 amino acids can be classified as either hydrophobic or hydrophilic. The free energy of amino acid side chains transferring from cyclohexane to water can be used to represent its hydrophobicity in a quantifiable manner.6 If the free energy is a positive value, the amino acid is hydrophobic, while negative values indicate hydrophilic amino acids. Hydrophobic amino acids are usually buried inside the protein core in protein three-dimensional structures, while the hydrophilic amino acids preferentially cover the surface of the protein three-dimensional structures. Furthermore, the hydrophilic amino acids are called polar amino acids. In a typical biological environment, some polar amino acids carry a charge, Lysine (+), Histidine (+), Arginine (+), Aspartate (−) and Glutamate (−), while other polar amino acids, Asparagine, Glutamine, Serine, Threonine and Tyrosine, are neutral.17 A detailed classification of the hydrophobic properties of the 20 standard acid sides is shown in Table 3-1. Other than hydrophobicity properties, the codon diversity and size of amino acids are also used as features. The codon diversity of an amino acid is reflected by the number of codons coding for the amino acid, and the size of an amino acid denotes its molecular volume.15

Table 3-1 The hydrophobic properties of 20 standard acid sides.


Some physicochemical property-based amino acid encodings have been proposed in previous studies. Fauchère et al.18 established 15 physicochemical descriptors of side chains for 20 natural and 26 non-coded amino acids which reflect hydrophobic, steric, electronic, and other properties of amino acid side chains. Radzicka and Wolfenden19 obtained digitized indications of the tendencies of amino acids to leave water and enter a truly nonpolar condensed phase in their experiments. Lohman et al.20 represented amino acids by using seven physicochemical properties to predict transmembrane protein sequences, and the properties are hydrophobicity, hydrophilicity, polarity, volume, surface area, bulkiness and refractivity. Atchley et al.15 used multivariate statistical analyses to produce multi-dimensional patterns of attribute covariation for the 20 standard amino acids, which reflect the polarity, secondary structure, molecular volume, codon diversity and electrostatic charge of amino acids.

3.2.3 Evolution-based encoding

The evolution-based encoding methods extract the evolutionary information of residues from sequence alignments or phylogenetic trees to represent amino acids, mainly by using the amino acid substitution probability. These evolution-based encoding methods can be categorized into two groups based on position relevance: position-independent methods and position-dependent methods.

The position-independent methods encode amino acids by using fixed encodings, regardless of the amino acid position in the sequence and the amino acid composition of the sequence. The most commonly used position-independent encoding methods are the PAM matrices and the BLOSUM matrices, and a common flowchart is shown in Fig. 3-1. The point accepted mutation (PAM) matrices represent the replacement probabilities for change from a single amino acid to another single amino acid in homologous protein sequences,13 which are focused on the evolutionary process of proteins. The PAM matrices are calculated from protein phylogenetic trees and related protein sequence pairs. The assumption of the PAM matrices is that the accepted mutation is similar in physical and chemical properties to the old one and the likelihood of amino acid X replacing Y is the same as that of Y replacing X; thus, the PAM matrices are 20 ∗ 20 symmetry matrices where each row and column represents one of the 20 standard amino acids. Corresponding to different lengths of evolution time, different PAM matrices can be generated. The 250 PAMs, which means the amino acid replacements to be found after 250 evolutionary changes, was found by the authors to be an effective scoring matrix for detecting distant relationships,13 and it is now widely used in related research.21,22 The blocks amino acid substitution matrices (BLOSUM)23 are amino acid substitution matrices derived based on conserved regions constructed by the PROTOMAT24 from non-redundant protein groups. The values in the BLOSUM matrices represent the probabilities that amino acid pairs will exchange places with each other. To reduce the contributions of most closely related protein sequences, the sequences are clustered within blocks. Different BLOSUM matrices can be generated by using different identical percentages for clusters, and the BLOSUM62 matrix performed better overall.23

Figure 3-1 The flowchart of position-independent amino acid encoding methods. First, the target proteins are selected (step 1). Then, the sequence alignments are constructed based on some criteria (step 2). Finally, the mutation matrix is calculated and is regarded as the amino acid encoding (step 3).

Different from position-independent matrices, the position-dependent methods encode amino acids at different positions by using different encodings, even if the amino acid types are the same. The position-dependent encodings are deduced from the multiple sequence alignments (MSAs) of target sequences; the flowchart for this is shown in Fig. 3-2. The position-specific scoring matrix (PSSM) is the most widely used encoding method. The PSSM is also called the position weight matrix (PWM), which represents the log-likelihoods of the occurrence probabilities of all possible molecule types at each location in a given biological sequence.25 Generally, the Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST)26 is used to execute sequence alignment and generate MSA for the target protein sequence. Then the corresponding PSSM is calculated from the MSA. For a protein sequence with length L, its PSSM is an L ∗ 20 matrix, in which each row represents the log-likelihoods of the probabilities of 20 amino acids occuring at its corresponding position. Besides the PSI-BLAST, the HMM-HMM alignment algorithm HHblits is also widely used to generate the probabilities profile, which is more sensitive than the sequence-profile alignment algorithm PSI-BLAST, as demonstrated by Remmert et al.27

Figure 3-2 The flowchart of position-dependent amino acid encoding methods. First, the target protein sequence is selected (step 1). Then, multiple sequence alignments are constructed by searching the protein sequence database (steps 2 and 3). Finally, the position weight is calculated by columns and is regarded as the corresponding amino acid encodings (step 4).

3.2.4 Structure-based encoding

The structure-based amino acid encoding methods, which can also be called statistical-based methods, encode amino acids by using structure-related statistical potentials, mainly using the inter-residue contact energies.28 The basic assumption is, in a large number of native protein structures, the average potentials of inter-residue contacts can reflect the differences of interaction between residue pairs,29 which play an important role in the formation of protein backbone structures.28 The inter-residue contact energies of the 20 amino acids are usually estimated based on amino acid pairing frequencies from native protein structures.28 The typical procedure to calculate the contact energies comprises three steps. First, a protein structure set is constructed from known native protein structures. Then, the inter-residue contacts of the 20 amino acids observed in those structures are counted. Finally, the contact energies are deduced from the amino acid contact frequencies by using the predefined energy function, and different contact energies reflect different contact potentials of amino acids in native structures.

Many previous studies have focused on structure-based encodings. In order to account for medium- and long-range interactions which determine the protein folding conformations, Tanaka and Scheraga28 evaluated the empirical standard free energies to formulate amino acid contacts from the contact frequencies. By employing the lattice model, Miyazawa and Jernigan29 estimated contact energies by using quasi-chemical approximation with an approximate treatment of the effects of chain connectivity. Later, they reevaluated the contact energies based on a larger set of protein structures and also estimated an additional repulsive packing energy term to provide an estimate of the overall energies of inter-residue interactions.30 To investigate the validity of the quasi-chemical approximation, Skolnick et al.31 estimated the expected number of contacts by using two reference states, the first of which treats the protein as a Gaussian random coil polymer and the second of which includes the effects of chain connectivity, secondary structure and chain compactness. The comparison results show that the quasi-chemical approximation is, in general, sufficient for extracting the amino acid pair potentials. To recognize native-like protein structures, Simmons et al.32 used distance-dependent statistical contact potentials to develop energy functions. Zhang and Kim33 estimated 60 residue contact energies that mainly reflect the hydrophobic interactions and show strong dependence on the three secondary structural states. These energies were found to be effective in threading and three-dimensional contact prediction according to their test results. Later, Cristian et al. set up an iterative scheme to extract the optimal interaction potentials between the amino acids.34

3.2.5 Machine-learning encoding

Different from earlier manually defined encoding methods, the machine-learning based encoding methods learn amino acid encodings from protein sequence or structure data by using machine learning methods, typically using artificial neural networks. In order to reduce the complexity of the model, the neural network for learning amino acid encodings is weightsharing for 20 amino acids. In general, the neural network contains three layers: the input layer, the hidden layer and the output layer. The input layer corresponds with the original encoding of the target amino acid, which can be one-hot encoding, physicochemical encoding, etc. The output layer also corresponds with the original encoding of the related amino acids. The hidden layer, which represents the new encoding of the target amino acid, usually has a reduced dimension compared with the original encoding.

To our knowledge, the earliest concept of learning-based amino acid encodings was proposed by Riis and Krogh.35 In order to reduce the redundancy of one-hot encoding, they used a 20 ∗ 3 weightsharing neural network to learn a 3-dimensional real number representation of 20 amino acids from one-hot encoding. Later, Jagla and Schuchhardt36 also used the weight sharing artificial neural network to learn a 2-dimensional encoding of amino acids for human signal peptide cleavage site recognition. Meiler et al.37 used a symmetric neural network to learn reduced representations of amino acids from amino acid physicochemical and statistical properties. The parameter representations were reduced from five and seven dimensions, respectively, to 1, 2, 3 or 4 dimensions, and then these reduced representations were used for ab initio prediction of protein secondary structure. Lin et al.8 used an artificial neural network to derive encoding schemes of amino acids from protein three-dimensional structure alignments, and each amino acid is described using the values taken from the hidden units of the neural network.

In recent years, several new machine-learning-based encoding methods9,38,39 have been proposed with reference to distributed word representation in natural language processing. In natural language processing, the distributed representation of words has been proven to be an effective strategy for use in many tasks.40 The basic assumption is that words sharing similar contexts will have similar meanings; therefore these methods train the neural network model by using the target word to predict its context words or by predicting the target word from its context words. After training on unlabeled datasets, the weights of the hidden units for each word are used as its distributed representation. In protein-related studies, a similar strategy has been used by assuming that the protein sequences are sentences, and that the amino acids or sub-sequences are words. In previous researches, these distributed representations of amino acids or sub-sequences show potential in protein family classification and disordered protein identification,9 protein function site predictions,38 protein functional property prediction,39 etc.

3.3Discussion

In this section, we will make a theoretical discussion of amino acid-encoding methods. First of all, we investigate the classification criteria of amino acid-encoding methods; second, we discuss the theoretical basis of these methods, and then analyze their advantages and limitations. Finally, we review and discuss the criteria for measuring an amino acid encoding method.

As introduced above, amino acid encoding methods have been divided into five categories according to their information sources and methodologies. However, it should be noted that the methods in one category are not completely different from those in others, and that there are some similarities between the encoding methods belonging to different categories. For example, the 6-bit one-hot encoding method proposed by Wang et al.12 is a dimension-reduced representation of the common one-hot encoding, but it is based on the six amino acid exchange groups which are derived from PAM matrices.13 There is another classification criterion based on position relevance. In an earlier section, evolution-based encoding methods were discussed, and it was mentioned that they are divided into two categories: position-independent methods and position-dependent methods. We can also group all of the amino acid encoding methods into these position-independent and position-dependent categories. Except for the position-specific scoring matrix (PSSM) and other similar encoding techniques that extract evolution features from multiple sequence alignments which are position-dependent methods, all the other amino acid encoding methods are position-independent methods. The position-dependent methods can capture homologous information, while position-independent ones can reflect the basic properties of amino acids. To some extent, these two types of methods can be complementary to each other. In practice, the combination of position-independent encoding and position-dependent encoding is often used, such as combining one-hot and PSSM,41 combining physicochemical properties encoding and PSSM,42 etc.

Theoretically, the functions of a protein are closely related to its tertiary structure, and its tertiary structure is mostly determined by the physicochemical properties of its amino acid sequence.43 From this perspective, all of the evolution-based encoding, structure-based encoding and machine-learning encoding methods extract information based on the physicochemical properties of the amino acid by using difference strategies. Specifically, different amino acids may have different mutation tendencies in the evolutionary process due to their hydrophobicity, polarity, volume and other properties. These mutation tendencies will be reflected in the sequence alignments and are detected by the evolution-based encoding methods. Similarly, the physicochemical properties of amino acids could affect the inter-residue contact potentials in tertiary protein structures, which form the basis of the structure-based encoding methods. And the machine-learning encoding methods also learn amino acid encoding from its physicochemical representation or evolution information (such as homologous protein structure alignments), which can be seen as another variant of physicochemical properties. Despite the fact that these encoding methods share a similar theoretical basis, their performance is different due to the restrictions in their implementation. As regards the one-hot encoding method, there is no artificial correlation between amino acids, but it is highly sparse and redundant, which leads to a complex machine learning model. The physicochemical properties of amino acids play fundamental roles in the protein folding process; theoretically, the physicochemical property encoding methods should be effective. However, as the protein folding-related physicochemical properties and their digital metrics are unknown, developing an effective physicochemical property encoding method is still an unresolved problem. The evolution-based encoding methods extract evolution information using just protein sequences, which could thus benefit from the dividends of large-scale protein sequence data. In particular, PSSM has shown significant performance in many studies.44 However, for those proteins without homologous sequences performances of evolution-based methods are limited. The structure-based encoding methods encode amino acids based on the potential of inter-residue contact, which denotes a low-dimensional representation of protein structure. Because of the limited number of known protein structures, their performance scope is limited. Early machine-learning encoding methods also face the problem of insufficient data samples, but several methods developed recently have overcome this problem by taking advantage of unlabeled sequence data.9,38,39

As discussed, different amino acid encoding methods have specific advantages and limitations; so, what is the most effective encoding method? According to Wang et al.,12 the best encoding method should significantly reduce the uncertainty of the output of the prediction model, or the encoding could capture both the global similarity and the local similarity of protein sequences; here, the global similarity refers to the overall similarity among multiple sequences while the local similarity refers to motifs in the sequences. Riis and Krogh35 proposed that redundancy encodings will lead the prediction model to be overfitting, and thus it needs to be simplified. Meiler et al.37 also tried to use reduced representations of amino acids’ physicochemical and statistical properties for protein secondary structure prediction. Zamani and Kremer4 stated that an effective encoding must store information associated with the problem at hand while diminishing superfluous data. In summary, an effective amino acid encoding method should be information-rich and non-redundant. “Information-rich” means the encoding contains enough information that is highly relevant to the protein structure and function, such as the physicochemical properties, evolution information, contact potential, and so on. “Non-redundant” means the encoding is compact and does not contain noise or other unrelated information. For example, in neural network-based protein structure and function prediction, redundancy encoding will lead to complicated networks with a very large number of weights, which leads to overfitting and restricts the generalization ability of the model. Therefore, under the premise of containing sufficient information, a more compact encoding will be more useful and generate more results.

Over the past two decades, several studies have been proposed to investigate effective amino acid encoding methods.5 David45 examined the effectiveness of various hydrophobicity scales by using a parallel cascade identification algorithm to assess the structure or functional classification of protein sequences. Zhong et al.46 compared orthogonal encoding, hydrophobicity encoding, BLOSUM62 encoding and PSSM encoding utilizing the Denoeux belief neural network for protein secondary structure prediction. Hu et al.6 combined orthogonal encoding, hydrophobicity encoding and BLOSUM62 encoding to find the most optimal encoding scheme by using the SVM with a sliding window training scheme for protein secondary structure prediction. From their test results, it can be seen that the combination of orthogonal and BLOSUM62 matrices showed the highest accuracy compared with all other encoding schemes. Zamani and Kremer4 investigated the efficiency of 15 amino acid encoding schemes, including orthogonal encoding, physicochemical encoding, and secondary structures- and BLOSUM62-related encoding, by training artificial neural networks to approximate the substitution matrices. Their experimental results indicate that the number (dimension) and the types (properties) of amino acid encoding methods are the two key factors playing a role in the efficiency of the encoding performance. Dongardive and Abraham47 compared the orthogonal, hydrophobicity, BLOSUM62, PAM250 and hybrid encoding schemes of amino acids for protein secondary structure prediction and found that the best performance was achieved using the BLOSUM62 matrix. These studies thus explored amino acid encoding methods from different perspectives, but they all just evaluated one part of the encoding methods on small datasets. To present a comprehensive and systematic comparison, in this chapter, we performed a large-scale comparative assessment of various amino acid encoding methods based on two tasks — protein secondary structure prediction and protein fold recognition — proposed in the following sections. It should be noted that our aim is assessing how much effective information is contained in different encoding methods, rather than exploring the optimal combination of encoding methods.

3.4The Assessment of Encoding Methods for Protein Secondary Structure Prediction

In computational biology, protein sequence labeling tasks, such as protein secondary structure prediction, solvent accessibility prediction, disorder region prediction and torsion angle prediction, have gained a great deal of attention from researchers. Among those sequence labeling tasks, protein secondary structure prediction is the most representative task,48 and several previous amino acid encoding studies have also paid attention to this topic.6,35,46,47 Therefore, we first assess the various amino acid encoding methods based on the protein secondary structure prediction task.

3.4.1 Encoding methods selection and generation

To perform a comprehensive assessment of different amino acid encoding methods, we select 16 representative encoding methods from each category for evaluation. A brief introduction of the 16 selected encoding methods is shown in Table 3-2. Except for PSSM and HMM encodings, most of these encodings are position-independent encodings and can be used directly to encode amino acids. It should be noted that some protein sequences may contain unknown amino acid types; these amino acids will be expressed by the average value of the corresponding column if the original encodings do not deal with this situation. For the ProtVec,9 which is a 3-gram encoding, we encode each amino acid by adding its left and right adjacent amino acid to form the corresponding 3-gram word. Since the start and end amino acids do not have enough adjacent amino acids to form 3-grams, they are represented by the “<unk>” encoding in ProtVec. Recently, further work on ProtVec (ProtVecX49) has demonstrated that the concatenation of ProtVec and k-mers could achieve better performance; here, we also evaluate the performance of ProtVec concatenated with 3-mers (named as ProtVec-3mer). For position-dependent encoding methods PSSM and HMM, we follow the common practice of generating them. Specifically, for the PSSM encoding of each protein sequence, we ran the PSI-BLAST26 tool with an e-value threshold of 0.001 and three iterations against the UniRef950 sequence database which is filtered at 90% sequence identity. HMM encoding is extracted from the HMM profile by running HHblits27 against the UniProt2050 protein database with parameters “-n 3 -diff inf -cov 60”. According to the HHsuite user guide, we use the first 20 columns of the HMM profile and convert the integers in the HMM profile to amino acid emission frequencies by using the formula: hfre = 2−0.001∗h, where h is the initial integer in the HMM profile and hfre is the corresponding amino acid emission frequency. h is set to 0 if it is an asterisk.

Table 3-2 A brief introduction of the 16 selected amino acid encoding methods.



3.4.2 Benchmark datasets for protein secondary structure prediction

Following several representative protein secondary structure prediction works11,42,51 published in recent years, we use the CullPDB dataset52 as training data and use four widely used test datasets — the CB513 dataset,53 the CASP10 dataset,54 the CASP11 dataset55 and the CASP12 dataset56 — as test data to evaluate the performance of different features. The CullPDB dataset is a large non-homologous sequence set produced by using the PISCES server,52 which culls subsets of protein sequences from the Protein Data Bank based on sequence identity and structural quality criteria. Here, we retrieved a subset of sequences that have structures with better than 1.8 Å resolution and share less than 25% sequence identity with each other. We also remove those sequences sharing more than 25% identity with sequences from the test dataset to ensure there is no homology between the training and the test datasets, and finally the CullPDB dataset contained 5748 protein sequences with lengths ranging from 18 to 1455. The CB513 dataset contains 513 proteins with less than 25% sequence similarity. The Critical Assessment of techniques for protein Structure Prediction (CASP) is a highly recognized community experiment to determine state-of-the-art methods in protein structure prediction from amino acids56; the recently released CASP10, CASP11 and CASP12 datasets are adopted as test datasets. It should be noted that the protein targets from CASP used here are based on the protein domain. Specifically, the CASP10 dataset contains 123 protein domains whose sequence lengths range from 24 to 498, the CASP11 dataset contains 105 protein domains whose sequence lengths range from 34 to 520, and the CASP12 dataset contains 55 protein domains whose sequence lengths range from 55 to 463.

Protein secondary structure labels are inferred by using the DSSP program57 from corresponding experimentally determined structures. The DSSP specifies 8 secondary structure states to each residue; here, we adopt 3-state secondary structure prediction as a benchmark task by converting 8 assigned states to 3 states: G, H, and I to H; B and E to E; and S, T, and C to C.

3.4.3 Performance comparison by using the Random Forests method

In order to use the information of neighboring residues, many previous protein secondary structure prediction methods apply the sliding window scheme and have demonstrated considerably good results.48 Referring to those methods, we also used the sliding window scheme to evaluate different amino acid encoding methods, and the diagram for this is shown in Fig. 3-3. The evaluation is based on the Random Forests method from the Scikit-learn toolboxes,58 the window size is 13 and the number of trees in the forest is 100. The comparison results are shown in Table 3-3.

Figure 3-3 The diagram of the sliding window scheme by using the Random Forests classifier for protein secondary structure prediction. The two target residues are Leu (L) and Ile (I) separately, the input for each target residue is independent.

First, we analyze and discuss the performance of different methods in the same category. For the binary encoding methods, one-hot encoding is the most widely used encoding method. The one-hot (6-bit) encoding and the binary 5-bit encoding are two dimension-reduced representations of the one-hot encoding. As can be seen from Table 3-3, the best performance is achieved by the one-hot encoding method, which demonstrates that some effective information could be lost after the artificial dimension reduction for one-hot (6-bit) encoding and binary 5-bit encoding. For the physicochemical properties encodings, the hydrophobicity matrix just contains hydrophobicity-related information and performs poorly, while the Meiler parameters and the Acthely factors are constructed from multiple physicochemical information sources and perform better. This shows that the integration of multiple physicochemical information sources and parameters is valuable. For evolution-based encodings, it is obvious that the position-dependent encodings (PSSM and HMM) are much more powerful than position-independent encodings (PAM250 and BLOSUM62), which shows that the homologous information is strongly associated with the protein structures. For the two structure-based encodings, they have comparative performances. For the three machine-learning encodings, the ANN4D performs better than the AESNN3 and the ProtVec, while the ProtVec-3mer encoding achieves similar performance compared with the ProtVec encoding. Second, on the whole, the position-dependent evolution-based encoding methods (PSSM and HMM) achieved the best performance. This result suggests that the evaluation information extracted from the MSAs is more conserved than the global information extracted from other sources. Third, the performances of different encoding methods show a certain degree of correlation with encoding dimensions, and the low-dimensional encodings, i.e. the one-hot (6-bit), binary 5-bit and two machine-learning encodings, have poorer performances than the high-dimensional encodings. This correlation could be due to the sliding window scheme and Random Forests algorithm; larger feature dimension is more conducive to recognizing the secondary structure states, but too large of a dimension will lead to poor performance (ProtVec and ProtVec-3mer).

Table 3-3 Protein secondary structure prediction accuracy of 16 amino acid encoding methods by using the Random Forests method.


3.4.4 Performance comparison by using the BRNN method

In recent years, deep learning-based methods for protein secondary structure prediction have achieved significant improvements.48 One of the most important advantages of deep learning methods is that they can capture both neighboring and long-range interactions, which could avoid the shortcomings of sliding window methods with handcrafted window size. For example, Heffernan et al.42 have achieved state-of-the-art performances by using the long short-term memory (LSTM) bidirectional recurrent neural networks. Therefore, to exclude the potential influence of the handcrafted window size, we also perform an assessment by using the bidirectional recurrent neural networks (BRNN) with long short-term memory cells. The model used here is similar to the model used in Heffernan’s work,42 as shown in Fig. 3-4, which contains two BRNN layers with 256 LSTM cells and two fully connected (dense) layers with 1024 and 512 nodes, and it is implemented based on the open-sourced deep learning library TensorFlow.59

The corresponding comparison results of the 16 selected encoding methods are shown in Table 3-4. From the overall view, the BRNN-based method was found to have better performance compared with the Random Forests-based method, but there are also some specific similarities and differences between them. For binary encoding methods, one-hot encoding still shows the best performance, which once again confirms the information loss of the one-hot (6-bit) and the binary 5-bit encoding methods. For the physicochemical property encodings, the Meiler parameters do not perform as well as the Acthely factors, suggesting that the Acthely factors are more efficient for deep learning methods. For the evolution-based encodings, the PSSM encoding achieves the best accuracy, while the HMM encoding just achieves as much accuracy as those position-independent encodings (PAM250 and BLOSUM62). The difference could be due to the different levels of homologous sequence identity. The HMM encoding is extracted from the UniProt20 database with 20% sequence identity, while the PSSM encoding is extracted from the UniRef90 database with 90% sequence identity. Therefore, for a certain protein sequence, its MSA from the UniProt20 database mainly contains remote homologous sequences, while its MSA from the UniRef90 database usually contains more homologous sequences. From the results in Table 3-4, the evaluation information of homologous sequences is more powerful for distinguishing different protein secondary structures than that of remote homologous sequences. For the structure-based encodings, the Micheletti potentials have much better performance when the BRNN method is used than when the Random Forests method is used. For machine-learning encodings, the ProtVec and ProtVec-3mer achieve significantly better performance compared with the values given in Table 3-4, which demonstrates the potential of machine-learning encoding. It is worth noting that ProtVec-3mer has better performance than ProtVec on the BRNN algorithm, corresponding to the authors’ recent work.49 Overall, for the deep learning algorithm BRNN, the position-dependent PSSM encoding still performs best among all encoding methods. For the position-independent encoding methods, the Micheletti potentials achieve the best performance, which demonstrates that the structure-related information has application potential in protein structure and function studies.

Figure 3-4 The architecture of the long short-term memory (LSTM) bidirectional recurrent neural networks for protein secondary structure prediction.

Table 3-4 Protein secondary structure prediction accuracy of 16 amino acid encoding methods by using the BRNN method.


3.5Assessments of Encoding Methods for Protein Fold Recognition

In addition to the protein sequence labeling tasks, protein sequence classification tasks have also received a lot of attention, such as protein remote homology detection60 and protein fold recognition.61,62 Here, we perform another assessment of the selected 16 amino acid encoding methods based on the protein fold recognition task. Many machine learning methods have been developed to classify protein sequences into different fold categories for protein fold recognition.60 The deep learning methods can automatically extract discriminative patterns from variable-length protein sequences and achieve significant success.61 Referring to Hou’s work,61 we used the one-dimensional deep convolution neural network (DCNN) to assess the usefulness of 16 selected encoding methods for protein fold recognition. As shown in Fig. 3-5, the deep convolution neural network used here has 10 hidden layers of convolution, 10 filters of each convolution layer with two window sizes (6 and 10), 20 maximum values at the max pooling layer and a flatter layer which is fully connected with the output layer to output the corresponding probability of each fold type.

Figure 3-5 The architecture of the one-dimensional deep convolution neural network for protein fold recognition.

3.5.1 Benchmark datasets for protein fold recognition

The most commonly used dataset to evaluate protein fold recognition methods is the SCOP database63 and its extended version, the SCOPe database.64 The SCOP is a manual structural classification of proteins whose three-dimensional structures have been determined. All of the proteins in SCOP are classified into four hierarchy levels: class, fold, superfamily and family. Folds represent the main characteristics of protein structures, and the protein fold could reveal the evolutionary process between the protein sequence and its corresponding tertiary structure.65 Here we use the F184 dataset which was constructed by Xia et al.66 based on the SCOPe database. The F184 dataset contains 6451 sequences with less than 25% sequence identity from 184 folds. Each fold contains at least 10 sequences, which could ensure that there are enough sequences for training and test purposes. Then we randomly selected 20% of the sequences as test data from each fold, leaving 80% of the sequences as training data. Finally, we got 5230 sequences for training and 1221 sequence for testing.

3.5.2 Performances of different encodings on protein fold recognition task

The comparison results of 16 selected encoding methods for protein fold recognition are listed in Table 3-5. It should be noted that the training process for each encoding method is repeated 10 times to eliminate stochastic effects. Different from the performances of protein secondary structure prediction, the performances of most position-independent encoding methods are similar. All of the binary, physicochemical and machine-learning-based encoding methods (except the ProtVec) achieve about 30% mean accuracies, demonstrating that the position-independent encodings could just offer limited information for protein fold classification. The two structure-based encodings have better accuracies — near 33% — demonstrating that the structure potential is more related with the protein fold type. The two evolution-based methods PAM250 and BLOSUM62 perform best among the 12 position-independent encoding methods, which means the evaluation information is more coupled with the protein structure. The position-dependent encoding methods PSSM and HMM achieve better performances, especially PSSM. It again indicates that the protein evaluation information is tightly coupled with the protein structure, and the homologous information is more useful than remote homologous information. The machine-learning-based AESNN3 and ANN4D encodings achieve comparable performances with other position-independent encoding methods but have much lower dimensions (3 for the AESNN3 and 4 for the ANN4D), showing its potential for further application. The performance of the ProtVec encoding is poor, and this could be caused by the overlapping strategy that has also been mentioned by the author.9 The ProtVec-3mer encoding has better performance, demonstrating the effectiveness of the combination of ProtVec and 3-mer.

Table 3-5 The performance differences between the various kinds of encodings.


Notes: Top 1: the accuracy calculated in the case that the first predicted folding type is the actual folding type. Top 5: the accuracy calculated in the case that the top 5 predicted fold types contain the actual fold type. Top 5: the accuracy calculated in the case that the top 10 predicted fold types contain the actual fold type. Mean: the mean value of accuracies on Top 1, Top 5, and Top 10.

It should be noted that the benchmark presented here is based on the DCNN method, and these encodings may achieve different performances by using other machine learning methods. The DCNN method could handle variable-length sequences and achieve significant success on fold recognition tasks, which are the main reasons for its selection here.

3.6Conclusions

Amino acid encoding is the first step of protein structure and function prediction, and it is one of the foundations to achieve final success in those studies. In this chapter, we proposed the systematic classification of various amino acid encoding methods and reviewed the methods of each category. According to information sources and information extraction methodologies, these methods are grouped into five categories: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding and machine-learning encoding. To benchmark and compare different amino acid encoding methods, we first selected 16 representative methods from those five categories. And then, based on the two representative protein-related studies, protein secondary structure prediction and protein fold recognition, we construct three machine learning models referring to the state-of-the-art studies. Finally, we encoded the protein sequence and implemented the same training and test phase on the benchmark datasets for each encoding method. The performance of each encoding method is regarded as the indicator of its potential in protein structure and function studies.

The assessment results show that the evolution-based position-dependent encoding method PSSM consistently achieves the best performance both on protein secondary structure prediction and protein fold recognition tasks, suggesting its important role in protein structure and function prediction. However, another evolution-based position-dependent encoding method — HMM — does not perform well, and the main reason for this could be that the remote homologous sequences only provide limited evaluation information for the target residue. For the one-hot encoding method, it is highly sparse and leads to complex machine learning models, while its two compressed representations, one-hot (6-bit) encoding and binary 5-bit encoding, lose more or less valuable information and cannot be widely used in related researches. More reasonable strategies to reduce the dimension of one-hot encoding need to be developed. For the physicochemical property encodings, the variety of properties and the extraction methodologies are two important factors needed to construct a valuable encoding. Structure-based encodings and machine-learning encodings achieve comparable or even better performances when compared with other widely used encodings, suggesting more attention needs to be paid to these two categories.

In a time when the dividends of data and algorithms have been highly released, exploring more effective encoding schemes for amino acids should be a key factor to further improve the performance of protein structure and function prediction. In the following, we provide some perspectives for future related studies. First, updated position-independent encodings should be constructed based on new protein datasets. Except for one-hot encoding, all other position-independent encoding methods construct their encodings based on the information extracted from the native protein sequences or structures. There is no doubt that random errors are unavoidable for those encodings and larger datasets will help to reduce those errors. As the development of sequencing and structure detection techniques has progressed and continues to progress, the number of protein sequences and structures has grown rapidly in the past years. Considering that most of the position-independent encoding methods were proposed one decade ago, it would be valuable to reconstruct them by using new datasets. Second, structure-based or function-based encoding methods require more attention. It has been demonstrated that structure-based encoding methods have ability in protein secondary structure prediction and protein fold recognition. These encodings reflect the structural potential of amino acids, which should be highly correlated with the protein structure and function. With the growing of number of proteins with known structure, the future prospect of structure-based encodings is considerable. Furthermore, the encodings reflecting function potentials may be more useful than others for protein function prediction; thus, exploring function-based encoding methods is a worthwhile topic. Third, the machine-learning encoding methods can be promising topics for future studies. As the amino acid encoding is an open problem, most encoding methods are based on an artificially defined basis, i.e. the physicochemical property encodings are constructed from protein fold-related properties observed by researchers, which will inevitably bring some unknown deviations. However, the machine-learning methods can avoid those artificial deviations by learning the amino acid encoding from biological data automatically. The protein sequences and natural languages share some similarities to a certain extent; for instance, the protein sequences can be comparable to sentences, and the amino acid or polypeptide chains can be comparable to words in languages. Considering that the word distributed representation has achieved comprehensive improved performances in natural language processing tasks, the protein sequences should also gain improvements by using the distributed representations of amino acids or n-gram amino acids. Some recent studies have demonstrated the potential of amino acid-distributed representations in protein family classification, disordered protein identification and protein functional property prediction, but most of these methods are concerned with the n-gram amino acid-distributed representations that cannot be directly used to predict the residue-level properties. Thus, residue-level distributed representations of amino acid is a topic that needs more attention.

References

[1]Liu B., Wang X., Lin L., Dong Q., Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinfo, 2008, 9(1): 510.

[2]Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res, 2015, 43(W1): W65–W71.

[3]Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2019, 20(4): 1280–1294.

[4]Zamani M., Kremer S.C. Amino acid encoding schemes for machine learning methods. In the 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2011, pp. 327–333.

[5]Yoo P.D., Zhou B.B., Zomaya A.Y. Machine learning techniques for protein secondary structure prediction: An overview and evaluation. Curr Bioinfo, 2008, 3(2): 74–86.

[6]Hu H.-J., Pan Y., Harrison R., Tai P.C. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans NanoBiosci, 2004, 3(4): 265–271.

[7]Miyazawa S., Jernigan R.L. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins, 1999, 34(1): 49–68.

[8]Lin K., May A.C.W., Taylor W.R. Amino acid encoding schemes from protein structure alignments: Multi-dimensional vectors to describe residue types. J Theor Biol, 2002, 216(3): 361–365.

[9]Asgari E., Mofrad M.R.K. Continuous distributed representation of biological sequences for deep proteomics and genomics. Plos One, 2015, 10(11): e0141287.

[10]Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res, 2008, 36(suppl 1): D202–D205.

[11]Wang S., Peng J., Ma J., Xu J. Protein secondary structure prediction using deep convolutional neural fields. Sci Rep, 2016, 6.

[12]Wang J.T.L., Ma Q., Shasha D., Wu C.H. New techniques for extracting features from protein sequences. IBM Syst J, 2001, 40(2): 426–441.

[13]Dayhoff M.O. A model of evolutionary change in proteins. Atlas Prot Seq Struct, 1978, 5: 89–99.

[14]White G., Seffens W. Using a neural network to backtranslate amino acid sequences. Electronic J Biotechnol, 1998, 1(3): 17–18.

[15]Atchley W.R., Zhao J., Fernandes A.D., Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci USA, 2005, 102(18): 6395–6400.

[16]Rose G., Geselowitz A., Lesser G., Lee R., Zehfus M. Hydrophobicity of amino acid residues in globular proteins. Science, 1985, 229(4716): 834–838.

[17]Betts M.J., Russell R.B. Amino acid properties and consequences of substitutions. Bioinfo Genet, 2003, 317: 289.

[18]Fauchère J.-L., Charton M., Kier L.B., Verloop A., Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Chem Biol Drug Design, 1988, 32(4): 269–278.

[19]Radzicka A., Wolfenden R. Comparing the polarities of the amino acids: side-chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, and neutral aqueous solution. Biochemistry, 1988, 27(5): 1664–1670.

[20]Reinhard L., Gisbert S., Dirk B., Paul W. A neural network model for the prediction of membrane spanning amino acid sequences. Prot Sci, 1994, 3(9): 1597–1601.

[21]Elofsson A. A study on protein sequence alignment quality. Proteins, 2002, 46(3): 330–339.

[22]Oren E.E., Tamerler C., Sahin D., Hnilova M., Seker U.O.S., Sarikaya M., Samudrala R. A novel knowledge-based approach to design inorganic-binding peptides. Bioinformatics, 2007, 23(21): 2816–2822.

[23]Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA, 1992, 89(22): 10915–10919.

[24]Henikoff S., Henikoff J.G. Automated assembly of protein blocks for database searching. Nucleic Acids Res, 1991, 19(23): 6565–6572.

[25]Stormo G.D., Schneider T.D., Gold L., Ehrenfeucht A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res, 1982, 10(9): 2997–3011.

[26]Altschul S.F., Koonin E.V. Iterated profile searches with PSI-BLAST — A tool for discovery in protein databases. Trends Biochem Sci, 1998, 23(11): 444–447.

[27]Remmert M., Biegert A., Hauser A., Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Meth, 2012, 9(2): 173.

[28]Tanaka S., Scheraga H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules, 1976, 9(6): 945–950.

[29]Miyazawa S., Jernigan R.L. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules, 1985, 18(3): 534–552.

[30]Miyazawa S., Jernigan R.L. Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol, 1996, 256(3): 623–644.

[31]Skolnick J., Godzik A., Jaroszewski L., Kolinski A. Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Prot Sci, 1997, 6(3): 676–688.

[32]Simmons, K.T., Ingo R., Charles K., A. F.B., Chris B., David B. Improved recognition of nativelike protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins: Structure, Function, and Bioinformatics, 1999, 34(1): 82–95.

[33]Zhang C., Kim S.-H. Environment-dependent residue contact energies for proteins. Proc Natl Acad Sci USA, 2000, 97(6): 2550–2555.

[34]Cristian M., Flavio S., R. B.J., Amos M. Learning effective amino acid interactions through iterative stochastic techniques. Proteins, 2001, 42(3): 422–431.

[35]Riis S.K., Krogh A. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J Comput Biol, 1996, 3(1): 163–183.

[36]Jagla B., Schuchhardt J. Adaptive encoding neural networks for the recognition of human signal peptide cleavage sites. Bioinformatics, 2000, 16(3): 245–250.

[37]Meiler J., Müller M., Zeidler A., Schmäschke F. Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Mol Model Annu, 2001, 7(9): 360–369.

[38]Xu Y., Song J., Wilson C., Whisstock J.C. PhosContext2vec: A distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci Rep, 2018, 8.

[39]Yang K.K., Wu Z., Bedbrook C.N., Arnold F.H. Learned protein embeddings for machine learning. Bioinformatics, 2018, 34(15): 2642–2648.

[40]Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Curran Associates inc., New York, USA, 2013, 3111–3119.

[41]Hou J., Adhikari B., Cheng J. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 2017, 34(8): 1295–1303.

[42]Heffernan R., Yang Y., Paliwal K., Zhou Y. Capturing non-local interactions by long short term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers, and solvent accessibility. Bioinformatics, 2017: btx218.

[43]Anfinsen C.B. Principles that govern the folding of protein chains. Science, 1973, 181(4096): 223–230.

[44]Chen J., Guo M., Wang X., Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinfo, 2018, 19(2): 231–244.

[45]David R. Applications of nonlinear system identification to protein structural prediction. Thesis (S.M.) — Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 2000.

[46]Zhong W., Altun G., Tian X., Harrison R., Tai P.C., Pan Y. Parallel protein secondary structure prediction based on neural networks. IEEE, 2004: 2968–2971.

[47]Dongardive J., Abraham S. Reaching optimized parameter set: Protein secondary structure prediction using neural network. Neural Comput Appl, 2017, 28(8): 1947–1974.

[48]Yang Y., Gao J., Wang J., Heffernan R., Hanson J., Paliwal K., Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief Bioinfo, 2016: bbw129.

[49]Asgari E., McHardy A.C., Mofrad M.R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep, 2019, 9(1): 3577.

[50]Consortium U. UniProt: A hub for protein information. Nucleic Acids Res, 2014, 43(D1): D204–D212.

[51]Li Z., Yu Y. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176, 2016.

[52]Wang G., Dunbrack R.L. PISCES: A protein sequence culling server. Bioinformatics, 2003, 19(12): 1589–1591.

[53]Cuff J.A. and Barton G.J. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 1999, 34(4): 508–519.

[54]John M., Krzysztof F., Andriy K., Torsten S., Anna T. Critical assessment of methods of protein structure prediction (CASP) — Round x. Proteins, 2013, 82(S2): 1–6.

[55]Kinch L.N., Li W., Schaeffer R.D., Dunbrack R.L., Monastyrskyy B., Kryshtafovych A., Grishin N.V. CASP 11 target classification. Proteins, 2016, 84(S1): 20–33.

[56]Moult J., Fidelis K., Kryshtafovych A., Schwede T., Tramontano A. Critical assessment of methods of protein structure prediction (CASP) — Round XII. Proteins, 2018, 86: 7–15.

[57]Wolfgang K., Christian S. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 2004, 22(12): 2577–2637.

[58]Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikitlearn: Machine learning in Python. J Machine Learning Res, 2011, 12(Oct): 2825–2830.

[59]Abadi M.N., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat S., Irving G., Isard M. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.

[60]Chen J., Guo M., Wang X., Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinfo, 2016, 19(2): 231–244.

[61]Hou J., Adhikari B., Cheng J. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 2017, 34(8): 1295–1303.

[62]Xia J., Peng Z., Qi D., Mu H., Yang J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics, 2016, 33(6): 863–870.

Biological Language Model

Подняться наверх