Читать книгу Bioinformatics - Группа авторов - Страница 49
PAM Matrices
ОглавлениеThe first useful matrices for protein sequence analysis were developed by Dayhoff et al. (1978). The basis for these matrices was the examination of substitution patterns in a group of proteins that shared more than 85% sequence identity. The analysis yielded 1572 changes in the 71 groups of closely related proteins that were examined. Using these results, tables were constructed that indicated the frequency of a given amino acid substituting for another amino acid at a given position.
As the sequences examined shared such a high degree of similarity, the resulting frequencies represent what would be expected over short evolutionary distances. Further, given the close evolutionary relationship between these proteins, one would expect that the observed mutations would not significantly change the function of the protein. This is termed acceptance: changes that can be accommodated through natural selection and result in a protein with the same or similar function as the original. As individual point mutations were considered, the unit of measure resulting from this analysis is the point accepted mutation or PAM unit. One PAM unit corresponds to one amino acid change per 100 residues, or roughly 1% divergence.
Several assumptions went into the construction of the PAM matrices. One of the most important assumptions was that the replacement of an amino acid is independent of previous mutations at the same position. Based on this assumption, the original matrix was extrapolated to come up with predicted substitution frequencies at longer evolutionary distances. For example, the PAM1 matrix could be multiplied by itself 100 times to yield the PAM100 matrix, which would represent what one would expect if there were 100 amino acid changes per 100 residues. (This does not imply that each of the 100 residues has changed, only that there were 100 total changes; some positions could conceivably change and then change back to the original residue.) As the matrices representing longer evolutionary distances are an extrapolation of the original matrix derived from the 1572 observed changes described above, it is important to remember that these matrices are, indeed, predictions and are not based on direct observation. Any errors in the original matrix would be exaggerated in the extrapolated matrices, as the mere act of multiplication would magnify these errors significantly.
There are additional assumptions that the reader should be aware of regarding the construction of these PAM matrices. All sites have been assumed to be equally mutable, replacement has been assumed to be independent of surrounding residues, and there is no consideration of conserved blocks or motifs. The sequences being compared here are of average composition based on the small number of protein sequences available in 1978, so there is a bias toward small, globular proteins, even though efforts have been made to bring in additional sequence data over time (Gonnet et al. 1992; Jones et al. 1992). Finally, there is an implicit assumption that the forces responsible for sequence evolution over shorter time spans are the same as those for longer evolutionary time spans. Although there are significant drawbacks to the PAM matrices, it is important to remember that, given the information available in 1978, the development of these matrices marked an important advance in our ability to quantify the relationships between sequences. As these matrices are still available for use with numerous bioinformatic tools, the reader should keep these potential drawbacks in mind and use them judiciously.