Читать книгу Bioinformatics - Группа авторов - Страница 50
BLOSUM Matrices
ОглавлениеIn 1992, Steve and Jorja Henikoff took a slightly different approach to the one described above, one that addressed many of the drawbacks of the PAM matrices. The groundwork for the development of new matrices was a study aimed at identifying conserved motifs within families of proteins (Henikoff and Henikoff 1991, 1992). This study led to the creation of the BLOCKS database, which used the concept of a block to identify a family of proteins. The idea of a block is derived from the more familiar notion of a motif, which usually refers to a conserved stretch of amino acids that confers a specific function or structure to a protein. When these individual motifs from proteins in the same family can be aligned without introducing a gap, the result is a block, with the term block referring to the alignment, not the individual sequences themselves. Obviously, any given protein can contain one or more blocks, corresponding to each of its structural or functional motifs. With these protein blocks in hand, it was then possible to look for substitution patterns only in the most conserved regions of a protein, the regions that (presumably) were least prone to change. Two thousand blocks representing more than 500 groups of related proteins were examined and, based on the substitution patterns in those conserved blocks, blocks substitution matrices (or BLOSUMs, for short) were generated.
Given the pace of scientific discovery, many more protein sequences were available in 1992 than in 1978, providing for a more robust base set of data from which to derive these new matrices. However, the most important distinction between the BLOSUM and PAM matrices is that the BLOSUM matrices are directly calculated across varying evolutionary distances and are not extrapolated, providing a more accurate view of substitution patterns (and, in turn, evolutionary forces) at those various distances. The fact that the BLOSUM matrices are calculated directly based only on conserved regions makes these matrices more sensitive to detecting structural or functional substitutions; therefore, the BLOSUM matrices perform demonstrably better than the PAM matrices for local similarity searches (Henikoff and Henikoff 1993).
Returning to the point of directly deriving the various matrices, each BLOSUM matrix is assigned a number (BLOSUMn), and that number represents the conservation level of the sequences that were used to derive that particular matrix. For example, the BLOSUM62 matrix is calculated from sequences sharing no more than 62% identity; sequences with more than 62% identity are clustered and their contribution is weighted to 1. The clustering reduces the contribution of closely related sequences, meaning that there is less bias toward substitutions that occur (and may be over-represented) in the most closely related members of a family. Reducing the value of n yields more distantly related sequences.