Читать книгу Bioinformatics - Группа авторов - Страница 70
Comparing FASTA and BLAST
ОглавлениеSince both FASTA and BLAST employ rigorous algorithms to find sequences that are statistically (and hopefully biologically) relevant, it is logical to ask which one of the methods is the better choice. There actually is no good answer to the question, since both of the methods bring significant strengths to the table. Summarized below are some of the fine points that distinguish the two methods from one another.
Figure 3.21 Search summary from a protein–protein FASTA search, using the sequence of histone H2B.3 from Hydractinia echinata (KX622131.1; Török et al. 2016) as the query and BLOSUM62 as the scoring matrix. The header indicates that the query is against the Swiss-Prot database. The histogram indicates the distribution of all similarity scores computed for this search. The left-most column provides a normalized similarity score, and the column marked opt
gives the number of sequences with that score. The column marked E()
gives the number of sequences expected to achieve the score in the first column. In this case, each equals sign in the histogram represents 130 sequences in Swiss-Prot. The asterisks in each row indicate the expected, random distribution of hits. The inset is a magnified version of the histogram in that region.
Figure 3.22 Hit list for the protein–protein FASTA search described in Figure 3.21. Only the first 18 hits are shown. For each hit, the accession number and partial definition line for the hit is provided. The column marked opt
gives the raw similarity score, the column marked bits
gives a normalized bit score (a measure of similarity between the two sequences), and the column marked E
gives the expectation value. The percentage columns indicate percent identity and percent similarity, respectively. The alen
column gives the total aligned length for each hit. The +-
characters shown at the beginning of some lines indicate that more than one alignment was found between the query and subject; in the case of the first hit (Q7Z5P9
), four alignments were returned. The align
link at the end of each row takes the user to the alignment for that hit (not shown).
FASTA begins the search by looking for exact matches of words, while BLAST allows for conservative substitutions in the first step.
BLAST allows for automatic masking of sequences, while FASTA does not.
FASTA will return one and only one alignment for a sequence in the hit list, while BLAST can return multiple results for the same sequence, each result representing a distinct HSP.
Since FASTA uses a version of the more rigorous Smith–Waterman alignment method, it generally produces better final alignments and is more apt to find distantly related sequences than BLAST. For highly similar sequences, their performance is fairly similar.
When comparing translated DNA sequences with protein sequences or vice versa, FASTA (specifically, FASTX/FASTY for translated DNA → protein and TFASTX/TFASTY for protein → translated DNA) allows for frameshifts.
BLAST runs faster than FASTA, since FASTA is more computationally intensive.
Several studies have attempted to answer the “which method is better” question by performing systematic analyses with test datasets (Pearson 1995; Agarawal and States 1998; Chen 2003). In one such study, Brenner et al. (1998) performed tests using a dataset derived from already known homologies documented in the Structural Classification of Proteins database (SCOP; Chapter 12). They found that FASTA performed better than BLAST in finding relationships between proteins having >30% sequence identity, and that the performance of all methods declines below 30%. Importantly, while the statistical values reported by BLAST slightly underestimated the true extent of errors when looking for known relationships, they found that BLAST and FASTA (with ktup = 2) were both able to detect most known relationships, calling them both “appropriate for rapid initial searches.”