Читать книгу Bioinformatics - Группа авторов - Страница 57
Performing a BLAST Search
ОглавлениеWhile many BLAST servers are available throughout the world, the most widely used portal for these searches is the BLAST home page at the National Center for Biotechnology Information (NCBI; Figure 3.5). The top part of the page provides access to the most frequently performed types of BLAST searches, summarized in Table 3.2, while the lower part of the page is devoted to specialized types of BLAST searches. To illustrate the relative ease with which one can perform a BLAST search, a protein-based search using BLASTP is discussed. Clicking on the Protein BLAST box brings users to the BLASTP search page, a portion of which is shown in Figure 3.6. Obviously, a query sequence that will be used as the basis for comparison is required. Harking back to the Entrez discussion in Chapter 2, the sequence of the netrin receptor from Homo sapiens (NP_005206.2) has been pasted into the query sequence box. Immediately to the right, the user can use the query subrange boxes to specify whether only a portion of this sequence is to be used; if the whole sequence is to be used, these fields should be left blank.
Figure 3.5 The National Center for Biotechnology Information (NCBI) BLAST landing page. Examples of the most commonly used queries that can be performed using the BLAST interface are discussed in the text.
Moving to the Choose Search Set section of the page, the database to be searched can be selected using the Database pull-down menu; clicking on the question mark next to the Database pull-down provides a brief description of each of the available target databases. Here, the search will be performed against the RefSeq database (see Box 1.2). Directly below, the Organism box can be used to limit the search results to sequences from individual organisms or taxa. While not part of this worked example, if the user wanted to limit the returned results to those from just mouse and rat, using the same type of syntax used in issuing Entrez searches (see Table 2.1), the user would type Mus musculus [ORGN] AND Rattus norvegicus [ORGN]
in this field; if the user wanted all results except those from mouse and rat, they would also need to check the Exclude box. As this search will be performed against RefSeq, one can exclude predicted proteins from the search results by clicking the “Models (XM/XP)” checkbox. Finally, in the Program Selection section, BLASTP is selected by default.
Figure 3.6 The upper portion of the BLASTP query page. The first section in the window is used to specify the sequence of interest, whether only a portion of that sequence should be used in performing the search (query subrange), which database should be searched, and which protein-based BLAST algorithm should be used to execute the query. See text for details.
If the user wishes to use the default settings for all algorithm parameters, the search can be submitted by simply clicking on the blue BLAST button. However, the user can exert finer control over how the search is performed by changing the items found in the Algorithm parameters section. To access these settings, the user must first click on the plus sign next to the words “Algorithm parameters” to expand this section of the web page, producing the view shown in Figure 3.7. This part of the query page is where the theory underlying a BLAST search discussed earlier in this chapter comes into play. In the General Parameters section, the expect threshold limits returned results to those having an E value lower than the specified value, with smaller values providing a more stringent cut-off. The word size setting changes the size of the query word used to initiate the BLAST search, with longer word sizes initiating the search with longer ungapped alignments. A word size of 3 is recommended for protein searches, as shorter words increase sensitivity; however, if searching for near-exact matches, a longer word size can be used, also yielding faster search times.
Figure 3.7 The lower portion of the BLASTP query page, showing algorithm parameters that the user can adjust to fine-tune the search. Values that have been changed for the search discussed in the text are highlighted in yellow and marked with a diamond. See text for details.
In the Scoring Parameters section, the user can select an appropriate scoring matrix (with the default being BLOSUM62). Changing the matrix automatically changes the gap penalties to values appropriate for that scoring matrix. As described in the discussion of affine gap penalties above, the user may change these values manually; increasing the gap costs would result in pairwise alignments with fewer gaps, where decreasing the values would make the insertion of gaps more permissive.
In the Filters and Masking section, one should filter to remove low-complexity regions. Low-complexity regions are defined simply as regions of biased composition (Wootton and Federhen 1993). These may include homopolymeric runs, short-period repeats, or the subtle over-representation of several residues in a sequence. The biological role of these low-complexity regions is not understood; it is thought that they may represent the results of either DNA replication errors or unequal crossing-over events. It is important to determine whether sequences of interest contain low-complexity regions; they tend to prove problematic when performing sequence alignments and can lead to false-positive results, as they are generally similar across unrelated proteins. Finally, before issuing the query, be sure to check the box marked “Show results in a new window.” This leaves the original query window (or tab) in place, making it easier to go back and refine or change search parameters, as needed.