Читать книгу Bioinformatics - Группа авторов - Страница 51
Which Matrices Should be Used When?
ОглавлениеAlthough most bioinformatic software will provide users with a default choice of a scoring matrix, the default may not necessarily be the most appropriate choice for the biological question being asked. Table 3.1 is intended to provide some guidance as to the proper selection of scoring matrix, based on studies that have examined the effectiveness of these matrices to detect known biological relationships (Altschul 1991; Henikoff and Henikoff 1993; Wheeler 2003). Note that the numbering schemes for the two matrix families move in opposite directions: more divergent sequences are found using higher numbered PAM matrices and lower numbered BLOSUM matrices. The following equivalencies are useful in relating PAM matrices to BLOSUM matrices (Wheeler 2003):
PAM250 is equivalent to BLOSUM45
PAM160 is equivalent to BLOSUM62
PAM120 is equivalent to BLOSUM80.
In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.
Table 3.1 Selecting an appropriate scoring matrix.
Matrix | Best use | Similarity |
PAM40 | Short alignments that are highly similar | 70–90% |
PAM160 | Detecting members of a protein family | 50–60% |
PAM250 | Longer alignments of more divergent sequences | ∼30% |
BLOSUM90 | Short alignments that are highly similar | 70–90% |
BLOSUM80 | Detecting members of a protein family | 50–60% |
BLOSUM62 | Most effective in finding all potential similarities | 30–40% |
BLOSUM30 | Longer alignments of more divergent sequences | <30% |
The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).