Читать книгу Bioinformatics - Группа авторов - Страница 65
BLAT
ОглавлениеIn response to the assembly needs of the Human Genome Project, a new nucleotide sequence alignment program called BLAT (for BLAST-Like Alignment Tool) was introduced (Kent 2002). BLAT is most similar to the MegaBLAST version of BLAST in that it is designed to rapidly align longer nucleotide sequences having more than 95% similarity. However, the BLAT algorithm uses a slightly different strategy than BLAST to achieve faster speeds. Before any searches are performed, the target databases are pre-indexed, keeping track of all non-overlapping 11-mers; this index is then used to find regions similar to the query sequence. BLAT is often used to find the position of a sequence of interest within a genome or to perform cross-species analyses.
Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text for details.
As an example, consider a case where an investigator wishes to map a cDNA clone coming from the Cancer Genome Anatomy Project (CGAP) to the rat genome. The BLAT query page is shown in Figure 3.18, and the sequence of the clone of interest has been pasted into the sequence box. Above the sequence box are several pull-down menus that can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). Once the appropriate choices have been made, the search is commenced by pressing the “Submit” button. The results of the query are shown in the upper panel of Figure 3.19; here, the hit with the highest score is shown at the top of the list, a match having 98.1% identity with the query sequence. More details on this hit can be found by clicking the “details” hyperlink, to the left of the entry. A long web page is then returned, providing information on the original query, the genomic sequence, and an alignment of the query against the found genomic sequence (Figure 3.19, bottom panel). The genomic sequence here is labeled chr5, meaning that the query corresponds to a region of rat chromosome 5. Matching bases in the cDNA and genomic sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps and unaligned regions are indicated by lower case black type. In the Side by Side Alignment, exact matches are indicated by the vertical line between the two sequences. Clicking on the “browser” hyperlink in the upper panel of Figure 3.19 would take the user to the UCSC Genome Browser, where detailed information about the genomic assembly in this region of rat chromosome 5 (specifically, at 5q31) can be obtained (cf. Chapter 4).
Figure 3.16 Results of the first round of a PSI-BLAST search. For each sequence found, the user is presented with the definition line from the corresponding UniProtKB/Swiss-Prot entry, the score value for the best high-scoring segment pair (HSP) alignment, the total of all scores across all HSP alignments, the percentage of the query covered by the HSPs, and the E value and percent identity for the best HSP alignment. The hyperlinked accession number allows for direct access to the source database record for that hit. Sequences whose “Select for PSI blast” box are checked will be used to calculate a position-specific scoring matrix (PSSM), and that PSSM then serves as the new “query” for the next round, the results of which are shown in Figure 3.17.
Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences identified through the use of the position-specific scoring matrix (PSSM) calculated based on the results shown in Figure 3.16 are highlighted in yellow. Check marks in the right-most column indicate which sequences were used to build the PSSM producing these results.
Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatomy Project Tumor Gene Index (CB312815) is the query. The pull-down menus at the top of the page can be used to specify which genome should be searched (organism), which assembly should be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). The “I'm feeling lucky” button returns only the highest scoring alignment and provides a direct path to the UCSC Genome Browser.
