Читать книгу Bioinformatics - Группа авторов - Страница 67

The Method

The FASTA algorithm can be divided into four major steps. In the first step, FASTA determines all overlapping words of a certain length both in the query sequence and in each of the sequences in the target database, creating two lists in the process. Here, the word length parameter is called ktup, which is the equivalent of W in BLAST. These lists of overlapping words are compared with one another in order to identify any words that are common to the two lists. The method then looks for word matches that are in close proximity to one another and connects them to each other (intervening sequence included), without introducing any gaps. This can be represented using a dotplot format (Figure 3.20a). Once this initial round of connections are made, an initial score (init₁) is calculated for each of the regions of similarity.

In step 2, only the 10 best regions for a given pairwise alignment are considered for further analysis (Figure 3.20b). FASTA now tries to join together regions of similarity that are close to each other in the dotplot but that do not lie on the same diagonal, with the goal of extending the overall length of the alignment (Figure 3.20c). This means that insertions and deletions are now allowed, but there is a joining penalty for each of the diagonals that are connected. The net score for any two diagonals that have been connected is the sum of the score of the original diagonals, less the joining penalty. This new score is referred to as init_n.

In step 3, FASTA ranks all of the resulting diagonals, and then further considers only the “best” diagonals in the list. For each of the best diagonals, FASTA uses a modification of the Smith–Waterman algorithm (1981) to come up with the optimal pairwise alignment between the two sequences being considered. A final, optimal score (opt) is calculated on this pairwise alignment.

Figure 3.20 The FASTA search strategy. (a) Once FASTA determines words of length ktup common to the query sequence and the target sequence, it connects words that are close to each other, and these are represented by the diagonals. (b) After an initial round of scoring, the top 10 diagonals are selected for further analysis. (c) The Smith–Waterman algorithm is applied to yield the optimal pairwise alignment between the two sequences being considered. See text for details.

In the fourth and final step, FASTA assesses the significance of the alignments by estimating what the anticipated distribution of scores would be for randomly generated sequences having the same overall composition (i.e. sequence length and distribution of amino acids or nucleotides). Based on this randomization procedure and on the results from the original query, FASTA calculates an expectation value E (similar to the BLAST E value), which, as before, represents the probability that a reported hit has occurred purely by chance.

Подняться наверх