Читать книгу Biological Language Model - Qiwen Dong - Страница 11
ОглавлениеChapter 2
Linguistic Feature Analysis of Protein Sequences
2.1Motivation and Basic Idea
Proteins play an important role in the function of complex biological systems. But the relationship between primary sequences, three-dimensional structures and functions of proteins is one of the most important unanswered questions in biology. With the completion of the Human Genome Project and all kinds of work in assessing biological sequences accurately, a large number of genomic and proteomic sequences are available for different organisms at present. The exponential increase of these data provides an opportunity for us to attack the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been successfully used in the domain of natural language processing. There are analogies between biological sequences and natural language. In linguistics, some words and phrases can form a meaningful sentence, while in biology, some tactic nucleotides denote genes and some fixed protein sequences can determine the structure and function of the protein.1 But is there a “language” in biological sequences?
Mantegna2 analyzed the linguistic features of noncoding DNA and emphasized that there exists a “language” in noncoding DNA. Although there are some insufficiencies in the work,3–5 many methods used in natural language processing have been used in biological sequences. N-grams of DNA6 and protein7 have been extracted. A bio-dictionary has been built and used to annotate proteins.8 Latent semantic analysis has been used to characterize the secondary structure of proteins.9 Probabilistic models from speech recognition have been used to enhance the protein domain discovery.10
The n-gram analysis method is one of the most frequently used techniques in computational linguistics. It takes the assumption that only the previous n − 1 words in a sentence have an effect on the probabilities for the next word.11 It has been successfully used in automatic speech recognition, document classification, information extraction, statistical machine translation and other challenging tasks in natural language. In this chapter, the n-grams of whole genome protein sequences have been extracted, their Zipf’s law has been analyzed and some statistical features have been extracted from the n-grams.
2.2Comparative n-gram Analysis
Amino acids are treated as words, since each amino acid carries a chemical “meaning”. In order to extract the n-gram from whole genome protein sequences, all the proteins of the same organism were arranged in series and split by blank, e.g. protein1 protein2 protein3 etc. Due to the large size of the genomic data, the suffix array12,13 was used to reduce the computational cost. To extract the n-gram statistical data, we developed a toolkit that can carry out the following functions:
1.Count protein number and length.
2.Count n-grams and most frequent n-grams.
3.Count n-grams of specified length.
4.Determine relative frequencies of specific n-grams across organisms.
5.Assess the distribution of n-gram frequencies in a specific organism.
The method was applied to protein sequences derived from whole genome sequences of 20 organisms. The protein sequence data was downloaded from the Swiss-prot database.14 The number of proteins varies from 484 (Mycoplasma genitalium) to 25612 (Human).
We developed a modification of Zipf-like analysis that could reveal differences between word usage in different organisms. First, the amino acid n-grams of a given length were sorted in descending order by frequency for the organism of choice. The comparative n-gram plots comparing the n-grams of one organism to those of other organisms were drawn using the top 20 n-grams. Figure 2-1 shows the comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4. The x-axis represents the ranked ngrams of a specific organism. The y-axis represents the corresponding frequency. The sorted n-grams of the organism of choice are shown as the bold line. Thick lines indicate the frequencies of n-grams with given rank in other organisms. Table 2-1 shows the 20 organisms used in this book.
In natural language, there are some words that are used frequently and some rarely; similarly in proteins, the frequencies of usage of the 20 amino acids are different. From the uni-gram plot of 20 organisms, Leucine was found to be one of the most frequent amino acids, ranked among the top three. Tryptophan and Cysteine, on the other hand, are the most rare amino acids, and their ranks occupy the last three spots. In language, frequent words are usually not closely related to the actual meaning of the sentence, whereas the rare words often are. So too is the case with the rare amino acids, which may be important for the structure and function of the protein.
Another statistical feature of n-grams is that there are organism-specific “phrases” in the protein sequences. Examples are shown in Fig. 2-1. In Human (Fig. 2-1(A)), the phrases “PPP” “PGP” and “SSP” are among the top 20 most frequently used 3-grams, but they are used in other organisms with very low frequencies. Also in R_norvegicus (Fig. 2-1(B)), similar phrases are “HTGE”, “GEKP”, “CGKA”, “GKAF”, “IHTG” and “PYKC”. These highly idiosyncratic n-grams suggest that there are organism-specific usages of “phrases” in protein sequences.
Table 2-1 Organism names used in the plot.
Organism | Organism |
A_thaliana | Human |
Aeropyrum_pernix | Methanopyrus_kandleri |
arabidopsis | Streptomyces_avermitilis |
Archaeoglobus_fulgidus | Mycoplasma_genitalium |
Bacillus_anthracis_Ames | Neisseria_meningitidis_MC58 |
Bifidobacterium_longum | Pasteurella_multocida |
Borrelia_burgdorferi | R_norvegicus |
Buchnera_aphidicola_Sg | s_pombe |
Encephalitozoon_cuniculi | Worm |
Fusobacterium_nucleatum | Yeastpom |
Figure 2-1 Comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4.
2.3The Zipf Law Analysis
Claiming Zipf’s law in a data set seems to be simple enough: if n values, xi (i = 1, 2 . . . n), are ranked by x1 ≥ x2 ≥ . . . xr . . . ≥ xn, Zipf’s law15 states that
where xr is in the data set whose rank is r, and C and α are constants which denote features of Zipf’s law. It can be rewritten as
This equation implies that the xr versus r plot on a log–log scale will be a straight line.
In natural language, the words’ frequency and their ranks follow Zipf’s law. Especially in English, Zipf’s law can be applicable to words, parts of speech, sentences and so on.
Zipf’s law of n-grams has been analyzed using the results of ngram statistics. Figure 2-2 shows the log–log plot of n-gram frequency versus their rank for A_thaliana (A) and Human (B). When n is larger than 4, the plot is similar to a straight line and the value of α is close to 0.5. We can claim that the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4.
A statistical measure giving partial information about the degree of complexity of a symbolic sequence is obtainable by calculating the n-gram entropy of the analyzed text. The Shannon n-gram entropy is defined as
Figure 2-2 Zipf’s Law analysis for A_thaliana (A) and Human (B).
where Pi is the frequency of the n-gram and λ is the number of letters of the alphabet.
From the n-gram entropy, one can obtain the redundancy R represented in any text. The redundancy is given as
where K = log2 λ. The redundancy is a manifestation of the flexibility of the underlying language.
To test whether the n-gram Zipf law could be explained by chance sampling, random genome protein sequences have been generated that have the same sequence length and frequency of amino acids as the natural genome. The process used to generate such random genome sequences is the same as the one used by Chatzidimitriou.3
The n-gram redundancy of natural and artificial genome protein sequences have been calculated for different values of n (see Fig. 2-3); n-gram redundancy can be approximately expressed as
Here, the alphabets are amino acids, and so the value of λ is 20.
From Fig. 2-3, one can see that the n-gram redundancy of the natural genome is larger than that of the artificial genome. This means that the n-gram entropy of the natural genome is small and that a “language” may exist in the protein sequence.
2.4Distinguishing the Organisms by Uni-Gram Model
Here, perplexity is used to distinguish the different organisms. Perplexity represents the predictive ability of a language model on a testing text. Let W = w[1], w[2] . . . w[n] denote a sequence of words in the testing text. Let Ck(i) be the context the language model chooses for the prediction of the word w[i]. Furthermore p(w[i] | ck(i)) denotes the probability assigned to the ith word by the model.
Figure 2-3 The n-gram redundancy comparison of a natural and random genome for A_thaliana (A) and Human (B).
The total probability (TP) of the sequence is
Then, the perplexity PP is
where n is the length of the total sequences.
A simple uni-gram (context-independent amino acid) model was trained by the 90 percent proteins from Borrelia_burgdorferi. The perplexity of the other 10 percent proteins and proteins from the other 19 organisms were calculated. Table 2-2 provides detailed results. Different organisms have different perplexities, which indicates that the different “dialects” in proteins may be embodied in the organisms. Another important phenomenon is that the perplexity is independent of the size of the testing proteins. To validate this, another experiment was carried out. The proteins of A_thaliana were used to train the uni-gram model, and then the human proteins were split into 10 shares randomly so as to calculate the perplexity. The results obtained were: 18.2049 18.2091 18.153 18.1905 18.2698 18.2101 18.1556 18.1495 18.3173 18.1925. These perplexities change at a small scale. So, the perplexity is related to the uni-gram model and the organism used to test it and has no relation to the size of testing proteins.
Table 2-2 The perplexities of different organisms.
2.5Conclusions
In this chapter, the n-gram and linguistic features of whole genome protein sequences have been analyzed. The results show that (1) the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4, (2) the Shannon n-gram entropy of the natural genome proteins is lower than that of artificial proteins, (3) a simple uni-gram model can distinguish different organisms, (4) there are organism-specific usages of “phrases” in protein sequences. Further work will aim at detailed identification of these “phrases” and the building of a “biological language” which has special words, phrases and syntaxes to map out the relationship of protein sequence, structure and function.
References
[1]Anfinsen C.B. Principles that govern the folding of protein chains. Science, 1973, 181(4096): 223–230.
[2]Mantegna R.N., Buldyrev S.V., Goldberger A.L., Havlin S., Peng C.K., Simons M., Stanley H.E. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 1995, 52(3): 2939–2950.
[3]Chatzidimitriou-Dreismann C.A., Streffer R.M., Larhammar D. Lack of biological significance in the ‘linguistic features’ of noncoding DNA — A quantitative analysis. Nucleic Acids Res, 1996, 24(9): 1676–1681.
[4]Tsonis A.A., Elsner J.B., Tsonis P.A. Is DNA a language? J Theor Biol, 1997, 184(1): 25–29.
[5]Voss R.F. Comment on “Linguistic features of noncoding DNA sequences”. Phys Rev Lett, 1996, 76(11): 1978.
[6]Burge C., Campbell A.M., Karlin S. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA, 1992, 89(4): 1358–1362.
[7]Ganapathiraju M., Weisser D., Rosenfeld R., Carbonell J., Reddy R., Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, San Diego 2002, pp. 1367–1375.
[8]Rigoutsos I., Huynh T., Floratos A., Parida L., Platt D. Dictionary-driven protein annotation. Nucleic Acids Res, 2002, 30(17): 3901–3916.
[9]Ganapathiraju M., Klein-Seetharaman J., Balakrishnan N., Reddy R. Characterization of protein secondary structure — Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine, 2004, 21: 78–87.
[10]Coin L., Bateman A., Durbin R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci USA, 2003, 100(8): 4516–4520.
[11]Charniak E. Statistical Language Learning. 1996. Cambridge, MA: MIT Press, p. 192.
[12]Manber U., Myers G. Suffix arrays: A new method for on-line string searches. SIAM J Comput, 1993, 22(5): 935–948.
[13]Kasai T., Lee G., Arimura H., Arikawa S., Park K. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. 2001, Jerusalem, Israel: Springer-Verlag, pp. 181–192.
[14]Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S., Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 2003, 31(1): 365–370.
[15]Kingsley Zipf G. Human Behavior and the Principle of Least Effort. SERBIULA (sistema Librum 2.0), 1948, II.