Читать книгу Bioinformatics - Группа авторов - Страница 51

Which Matrices Should be Used When?

Оглавление

Although most bioinformatic software will provide users with a default choice of a scoring matrix, the default may not necessarily be the most appropriate choice for the biological question being asked. Table 3.1 is intended to provide some guidance as to the proper selection of scoring matrix, based on studies that have examined the effectiveness of these matrices to detect known biological relationships (Altschul 1991; Henikoff and Henikoff 1993; Wheeler 2003). Note that the numbering schemes for the two matrix families move in opposite directions: more divergent sequences are found using higher numbered PAM matrices and lower numbered BLOSUM matrices. The following equivalencies are useful in relating PAM matrices to BLOSUM matrices (Wheeler 2003):

 PAM250 is equivalent to BLOSUM45

 PAM160 is equivalent to BLOSUM62

 PAM120 is equivalent to BLOSUM80.

In addition to the protein matrices discussed here, there are numerous specialized matrices that are either specific to a particular species, concentrate on particular classes of proteins (e.g. transmembrane proteins), focus on structural substitutions, or use hydrophobicity measures in attempting to assess similarity (see Wheeler 2003). Given this landscape, the most important take-home message for the reader is that no single matrix is the complete answer for all sequence comparisons. A thorough understanding of what each matrix represents is critical to performing proper sequence-based analyses.

Table 3.1 Selecting an appropriate scoring matrix.

Matrix Best use Similarity
PAM40 Short alignments that are highly similar 70–90%
PAM160 Detecting members of a protein family 50–60%
PAM250 Longer alignments of more divergent sequences ∼30%
BLOSUM90 Short alignments that are highly similar 70–90%
BLOSUM80 Detecting members of a protein family 50–60%
BLOSUM62 Most effective in finding all potential similarities 30–40%
BLOSUM30 Longer alignments of more divergent sequences <30%

The Similarity column gives the range of similarities that the matrix is able to best detect (Wheeler 2003).

Bioinformatics

Подняться наверх