Читать книгу Bioinformatics - Группа авторов - Страница 47

Scoring Matrices

Whether one uses a global or local alignment method, once the two sequences under consideration are aligned, how does one actually measure how good the alignment is between “sequence A” and “sequence B”? The first step toward answering that question involves numerical methods that consider not just the position-by-position overlap between two sequences but also the nature and characteristics of the residues or nucleotides being aligned.

Much effort has been devoted to the development of constructs called scoring matrices. These matrices are empirical weighting schemes that appear in all analyses involving the comparison of two or more sequences, so it is important to understand how these matrices are constructed and how to choose between matrices. The choice of matrix can (and does) strongly influence the results obtained with most sequence comparison methods.

The most commonly used protein scoring matrices consider the following three major biological factors.

1 Conservation. The matrices need to consider absolute conservation between protein sequences and also need to provide a way to assess conservative amino acid substitutions. The numbers within the scoring matrix provide a way of representing what amino acid residues are capable of substituting for other residues while not adversely affecting the function of the native protein. From a physicochemical standpoint, characteristics such as residue charge, size, and hydrophobicity (among others) need to be similar.Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 is the most widely used scoring matrix for protein analysis and provides best coverage for general-use cases. Standard single-letter codes to the left of each row and at the top of each column specify each of the 20 amino acids. The ambiguity codes B (for asparagine or aspartic acid; Asx) and Z (for glutamine or glutamic acid; Glx) also appear, as well as an X (denoting any amino acid). Note that the matrix is a mirror image of itself with respect to the diagonal. See text for details.
2 Frequency. In the same way that amino acid residues cannot freely substitute for one another, the matrices also need to reflect how often particular residues occur among the entire constellation of proteins. Residues that are rare are given more weight than residues that are more common.
3 Evolution. By design, scoring matrices implicitly represent evolutionary patterns, and matrices can be adjusted to favor the detection of closely related or more distantly related proteins. The choice of matrices for different evolutionary distances is discussed below.

There are also subtle nuances that go into constructing a scoring matrix, and these are described in an excellent review by Henikoff and Henikoff (2000).

How these various factors are actually represented within a scoring matrix can be best demonstrated by deconstructing the most commonly used scoring matrix, called BLOSUM62 (Figure 3.1). Each of the 20 amino acids (as well as the standard ambiguity codes) is shown along the top and down the side of a matrix. The scores in the matrix actually represent the logarithm of an odds ratio (Box 3.1) that considers how often a particular residue is observed, in nature, to replace another residue. The odds ratio also considers how often a particular residue would be replaced by another if replacements occurred in a random fashion (purely by chance). Given this, a positive score indicates two residues that are seen to replace each other more often than by chance, and a negative score indicates two residues that are seen to replace each other less frequently than would be expected by chance. Put more simply, frequently observed substitutions have positive scores and infrequently observed substitutions have negative scores.

Подняться наверх