Читать книгу Bioinformatics - Группа авторов - Страница 55
The Algorithm
ОглавлениеBLAST is a local alignment method that is capable of detecting not only the best region of local alignment between a query sequence and its target, but also whether there are other plausible alignments between the query and the target. To find these regions of local alignment in a computationally efficient fashion, the method begins by seeding the search with a small subset of letters from the query sequence, known as the query word. Using the example shown in Figure 3.3, consider a search where the query word of default length 3 is RDQ. (In practice, all words of length 3 are considered, so, using the sequence in Figure 3.3, the first query word would be TLS, followed by LSH, and so on across the sequence.) BLAST now needs to find not only the word RDQ in all of the sequences in the target database but also related words where conservative substitutions have been introduced, as those matches may also be biologically informative and relevant. To determine which words are related to RDQ, scoring matrices are used to develop what is called the neighborhood. The center panel of Figure 3.3 shows the collection of words that are related to the original query word, in descending score order; the scores here are calculated using a BLOSUM62 scoring matrix (Figure 3.1). Obviously, some cut-off must be applied so that further consideration is only given to words that are indeed closely related to the original query word. The parameter that controls this cut-off is the neighborhood score threshold (T). The value of T is determined automatically by the BLAST program but can be adjusted by the user. Increasing T would push the search toward more exact matches and would speed up the search, but could lead to overlooking possibly interesting biological relationships. Decreasing T allows for the detection of more distant relationships between sequences. Here, only words with T ≥ 11 move to the next step.
Table 3.2 BLAST algorithms.
Program | Query | Database |
BLASTN | Nucleotide | Nucleotide |
BLASTP | Protein | Protein |
BLASTX | Nucleotide, six-frame translation | Protein |
TBLASTN | Protein | Nucleotide, six-frame translation |
TBLASTX | Nucleotide, six-frame translation | Nucleotide, six-frame translation |
Figure 3.3 The initiation of a BLAST search. The search begins with query words of a given length (here, three amino acids) being compared against a scoring matrix to determine additional three-letter words “in the neighborhood” of the original query word. Any occurrences of these neighborhood words in sequences within the target database are then investigated. See text for details.
Focusing now on the lower panel of Figure 3.3, the original query word (RDQ) has been aligned with another word from the neighborhood whose score is more than the score threshold of T ≥ 11 (REQ). The BLAST algorithm now attempts to extend this alignment in both directions, tallying a cumulative score resulting from matches, mismatches, and gaps, until it constructs a local alignment of maximal length. Determining what the maximal length actually is can be best explained by considering the graph in Figure 3.4. Here, the number of residues that have been aligned is plotted against the cumulative score resulting from the alignment. The left-most point on the graph represents the alignment of the original query word with one of the words from the neighborhood, again having a value of T = 11 or greater. As the extension proceeds, as long as exact matches and conservative substitutions outweigh mismatches and gaps, the cumulative score will increase. As soon as the cumulative score breaks the score threshold S, the alignment is reported in the BLAST output. Simply clearing S does not automatically mean that the alignment is biologically significant, a very important point that will be addressed later in this discussion.
Figure 3.4 BLAST search extension. Length of extension represents the number of characters that have been aligned in a pairwise sequence comparison. Cumulative score represents the sum of the position-by-position scores, as determined by the scoring matrix used for the search. T represents the neighborhood score threshold, S is the minimum score required to return a hit in the BLAST output, and X is the significance decay. See text for details.
As the extension continues, at some point, mismatches and gaps will begin to outweigh the exact matches and conservative substitutions, accruing negative scores from the scoring matrix. As soon as the curve begins to turn downward, BLAST measures whether the drop-off exceeds a threshold called X. If the curve decays more than is allowed by the value of X, the extension is terminated and the alignment is trimmed back to the length corresponding to the preceding maximum in the curve. The resulting alignment is called a high-scoring segment pair, or HSP. Given that the BLAST algorithm systematically marches across the query sequence using all possible query words, it is possible that more than one HSP may be found for any given sequence pair.
After an HSP is identified, it is important to determine whether the resulting alignment is actually significant. Using the cumulative score from the alignment, along with a number of other parameters, a new value called E (for “expect”) is calculated (Box 3.2). For each hit, E gives the number of expected HSPs having a score of S or more that BLAST would find purely by chance. Put another way, the value of E provides a measure of whether the reported HSP is a false positive (see Box 5.4). Lower E values imply greater biological significance.