Читать книгу Bioinformatics - Группа авторов - Страница 48
Box 3.1 Scoring Matrices and the Log Odds Ratio
ОглавлениеProtein scoring matrices are derived from the observed replacement frequencies of amino acids for one another. Based on these probabilities, the scoring matrices are generated by applying the following equation:
where pi is the probability with which residue i occurs among all proteins and pj is the probability with which residue j occurs among all proteins. The quantity qi,j represents how often the two amino acids i and j are seen to align with one another in multiple sequence alignments of protein families or in sequences that are known to have a biological relationship. Therefore, the log odds ratio Si,j (or “lod score”) represents the ratio of observed vs. random frequency for the substitution of residue i by residue j. For commonly observed substitutions, Si,j will be greater than zero. For substitutions that occur less frequently than would be expected by chance, Si,j will be less than zero. If the observed frequency and the random frequency are the same, Si,j will be zero.
To explain the meaning of the numbers in the matrix more fully, imagine that two sequences have been aligned with one another, and it is now necessary to assess how well a residue in sequence A matches to a residue in sequence B at any given position of the alignment. Using the scoring matrix in Figure 3.1 as our starting point,
The values on the diagonal represent the score that would be conferred for an exact match at a given position, and these numbers are always positive. So, if a tryptophan residue (W) in sequence A is aligned with a tryptophan residue in sequence B, this match would be conferred 11 points, the value where the row marked W intersects the column marked W. Also notice that 11 is the highest value on the diagonal, so the high number of points assigned to a W:W alignment reflects not only the exact match but also the fact that tryptophan is the rarest of amino acids found in proteins. Put otherwise, the W:W alignment is much less likely to occur in general and, in turn, is more likely to be correct.
Moving off the diagonal, consider the case of a conservative substitution: a tyrosine (Y) for a tryptophan. The intersection of the row marked Y with the column marked W yields a value of 2. The positive value implies that the substitution is observed to occur more often in an alignment than it would by chance, but the replacement is not as good as if the tryptophan residue had been preserved (2 < 11) or if the tyrosine residue had been preserved (2 < 7).
Finally, consider the case of a non-conservative substitution: a valine (V) for a tryptophan. The intersection of the row marked V with the column marked W yields a value of −3. The negative value implies that the substitution is not observed to occur frequently and may arise more often than not by chance.
Although the meaning of the numbers and relationships within the scoring matrices seems straightforward enough, some value judgments need to be made as to what actually constitutes a conservative or non-conservative substitution and how to assess the frequency of either of those events in nature. This is the major factor that differentiates scoring matrices from one another. To help the reader make an intelligent choice, a discussion of the approach, advantages, and disadvantages of the various available matrices is in order.