Читать книгу Bioinformatics - Группа авторов - Страница 37

The Entrez Discovery Pathway

Оглавление

The best way to illustrate the integrated nature of the Entrez system and to drive home the power of neighboring is by considering some biological examples. The simplest way to query Entrez is through the use of individual search terms, coupled together by Boolean operators such as AND, OR, or NOT. Consider the case in which one wants to retrieve all available information on a gene named DCC (deleted in colorectal carcinoma), limiting the returned information to publications where an investigator named Guy A. Rouleau is an author. There is a very simple query interface at the top of the NCBI home page, allowing the user to select which database they want to search from a pull-down menu and a text box where the query terms can be entered. In this case, to search for published papers, PubMed would be selected from the pull-down menu and, within the text box to the right, the user would type DCC AND "Rouleau GA" [AU]. The [AU] qualifying the second search term indicates to Entrez that this is an author term, so only the author field in entries should be considered when evaluating this part of the search statement. The result of the query is shown in Figure 2.2. Here, three entries matching the query were found in PubMed. The user can further narrow down the query by adding additional terms if the user is interested in a more specific aspect of this gene or if there are quite simply too many entries returned by the initial query. A list of available field delimiters is given in Table 2.1.


Figure 2.2 Results of a text-based Entrez query against PubMed using Boolean operators and field delimiters. The initial query (DCC AND "Rouleau GA" [AU]) is shown in the search box near the top of the window, with the three papers identified using this query following below. Each entry gives the title of the manuscript, the names of the authors, and the citation information. The actual record can be retrieved by clicking on the name of the manuscript.

Table 2.1 Entrez Boolean search statements.

General syntax: search term [tag] Boolean operator search term [tag] ... where [tag] =
[ACCN] Accession
[AD] Affiliation
[ALL] All fields
[AU] Author nameLentz R [AU] yields all of Lentz RA, Lentz RB, etc."Lentz R" [AU] yields only Lentz R
[AUID] Unique author identifier, such as an ORCID ID
[ECNO] Enzyme Commission numbers
[EDAT] Entrez dateYYYY/MM/DD , YYYY/MM, or YYYY; insert a colon for date range, e.g. 2016:2018
[GENE] Gene name
[ISS] Issue of journal
[JOUR] Journal title, official abbreviation, or ISSN numberJournal of Biological ChemistryJ Biol Chem0021-9258
[LA] Language
[MAJR] MeSH major topicOne of the major topics discussed in the article
[MH] MeSH termsControlled vocabulary of biomedical terms (subject)
[ORGN] Organism
[PDAT] Publication dateYYYY/MM/DD , YYYY/MM, or YYYY; insert a colon for date range, e.g. 2016:2018
[PMID] PubMed ID
[PROT] Protein name (for sequence records)
[PT] Publication type, includes:ReviewClinical TrialLecturesLetterTechnical Report
[SH] MeSH subheadingUsed to modify MeSH Termsstenosis [MH] AND pharmacology [SH]
[SUBS] Substance nameName of chemical discussed in article
[SI] Secondary source IDNames of secondary source databanks and/or accession numbers of sequences discussed in article
[TITL] Title wordOnly words in the definition line (not available in Structure database)
[WORD] Text wordsAll words and numbers in the title and abstract, MeSH terms, subheadings, chemical substance names, personal name as subject, and MEDLINE secondary sources
[VOL] Volume of journal
and Boolean operator = AND, OR, or NOT

For each of the found papers shown in the Summary view in Figure 2.2, the user is presented with the title of the paper, the authors of that paper, and the citation. To look at any of the papers resulting from the search, the user can simply click on any of the hyperlinked titles. For this example, consider the third reference in the list, by Srour et al. (2010). Clicking on the title takes the user to the Abstract view shown in Figure 2.3. This view presents the name of the paper, the list of authors, their institutional affiliation, and the abstract itself. Below the abstract is a gray bar labeled “MeSH terms, Substances”; clicking on the plus sign at the end of the gray bar reveals cataloging information (MeSH terms, for medical subject headings) and indexed substances related to the manuscript. Several alternative formats are available for displaying this information, and these various formats can be selected using the Format pull-down menu found in the upper left corner of the window. Switching to MEDLINE format produces the MEDLINE layout, with two-letter codes corresponding to the contents of each field going down the left-hand side of the entry (e.g. the author field is again denoted by the code AU). Lists of entries in this format can be saved to the desktop and easily imported into third-party bibliography management programs.

Figure 2.3 An example of a PubMed record in Abstract format, as returned through Entrez. This Abstract view is for the third reference shown in Figure 2.2. This view provides connections to related articles, sequence information, and the full-text journal article through the Discovery Column that runs down the right-hand side of the page. See text for details.

The column on the right-hand side of this window – aptly named the Discovery Column – provides access to the full-text version of the paper and, more importantly, contains many useful links to additional information related to this manuscript. The Similar articles section provides one of the entry points from which the user can take advantage of the neighboring and hard link relationships described earlier and, in the examples that follow, we will return to this page several times to illustrate a selected cross-section of the kinds of information available to the user. To begin this journey, if the user clicks on the See all link at the bottom of that section, Entrez will return a list of 104 references related to the original Rouleau paper at the time of this writing; the first six of these papers are shown in Figure 2.4. The first paper in the list is the same Rouleau paper because, by definition, it is most related to itself (the “parent” entry). The order in which the related papers follow is based on statistical similarity. Thus, the entry closest to the parent is deemed to be the closest in subject matter to the parent. By scanning the titles, the user can easily find related information on other studies, as well as quickly amass a bibliography of relevant references. This is a particularly useful and time-saving function when one is writing grants or papers, as abstracts can easily be scanned and papers of real interest can be identified quickly.


Figure 2.4 Neighbors to an entry found in PubMed. The original entry from Figure 2.3 (Srour et al. 2010) is at the top of the list, indicating that this is the parent entry. Additional neighbors to each of the papers in this list can be found by clicking the Similar articles link found below each entry. See text for details.


Figure 2.5 The Entrez Gene page for the DCC (deleted in colorectal carcinoma) netrin-1 receptor from human. The entry indicates that this is a protein-coding gene at map location 18q21.2, and information on the genomic context of DCC, as well as alternative gene names and information on the encoded protein, is provided. An extensive collection of links to other National Center for Biotechnology Information (NCBI) and external databases is also provided. See text for details.

Returning to the Abstract view presented in Figure 2.3, at the bottom of the Discovery Column is a series of hard-link connections to other databases within the Entrez system that can take the user directly to an extensive set of information related to the content of the publication of interest. Here, selecting the Gene link takes the user to Entrez Gene, a feature of Entrez that provides a wealth of information about the gene in question (Figure 2.5). The data are gathered from a variety of sources, including RefSeq. Here, we see that DCC is the official symbol of a protein-coding gene for a netrin-1 receptor in humans. The Genomic context section of this page indicates that the DCC is a protein-coding gene at map location 18q21.2. Immediately below, summary information on the genomic region, transcripts, and products of the DCC gene are presented graphically, with genomic coordinates provided. Additional content not shown in the figure can be found by scrolling down the Gene page, where the user will find relevant functional information (such as gene expression data), associated phenotypes, information on protein–protein interactions, pathway information, Gene Ontology assignments, and homologies to similar sequences in selected organisms. Shortcut links to these sections can be found in the Table of contents at the top of the Discovery Column. Further down the Discovery Column are extensive lists of links to additional resources provided through NCBI and other sources. One link of note is the SNP: Gene View link, taking the user to data derived from dbSNP (Figure 2.6). The information found within dbSNP goes beyond just single-nucleotide polymorphisms (SNPs), including data on short genetic variations such as short insertions and deletions, short tandem repeats, and microsatellites. Here, we will focus on the table shown in Figure 2.6, which is a straightforward way to view information about individual SNPs. Each SNP entry occupies two or more lines of the table, with one line showing the contig reference (the more common allele) and the other showing the SNP (the less common allele). Consider the first three lines of the table, showing a contig reference G for which there are two documented SNPs, changing the G at that position to either an A or a C. At the protein level, this changes the amino acid at position 2 of the DCC protein from glutamic acid to lysine (for the G-to-A substitution) or to glutamine (for the G-to-C substitution). These rows are colored red since these are “non-synonymous SNPs” – that is, the SNP produces a discrete change at the amino acid level. In contrast, consider the first set of green rows in the table, with the green indicating that this is a “synonymous SNP,” where the codons for the contig reference (G) and the SNP allele (A) ultimately produce the same amino acid (Glu); this is not altogether surprising, with the SNP being in the wobble position of the codon, where there is often redundancy in the genetic code. Additional information on human SNPs can be found in Chapter 15.


Figure 2.6 A section of the Database of Single Nucleotide Polymorphisms (dbSNP) GeneView page providing information on each SNP identified within the human DCC gene. See text for details.

Starting again from the Abstract view shown in Figure 2.3, protein sequences from RefSeq that have been linked to this abstract can be found by clicking on the Protein (RefSeq) link found in the Related information section on the right-hand side of the page, producing the view shown in Figure 2.7. Note that all but one of the entries is marked as “predicted”; the final entry in the list has an accession number beginning with NP, indicating that it contains an experimentally determined or verified sequence (see Box 1.2). Clicking on the first line of that entry (number 6) takes the user to the view shown in Figure 2.8, the RefSeq entry for the netrin receptor, the protein product of the DCC gene. The feature table – the section of the GenBank entry listing the location and characteristics of each of the documented biological features found within this protein sequence, such as post-translational modifications, recognizable repeat units, secondary structural regions, and clinically relevant variation – is particularly long in this case. This makes it difficult to determine the relative orientation of the features to one another and may lead the user to miss important interactions or relationships between biological features. Fortunately, a viewer that provides a bird's eye view of the elements found within the feature table is available by clicking on the Graphics link at the top of the entry, producing the more accessible display shown in Figure 2.9. Zoom controls are provided, and hovering over any of the elements in the display produces a pop-up containing the specific information for that feature from the GenBank entry.


Figure 2.7 Entries in the RefSeq protein database corresponding to the original Srour et al. (2010) entry in Figure 2.3. Entries can be accessed and examined by clicking on any of the accession numbers. See text for details.


Figure 2.8 The RefSeq entry for the netrin receptor, the protein product of the human DCC gene. The FASTA link at the top of the entry provides quick access to the protein sequence in FASTA format, while the Graphics link provides access to a graphical view of all of the individual elements captured within the entry's feature table (see Figure 2.9). See text for details.


Figure 2.9 The same RefSeq entry for the netrin receptor shown in Figure 2.8, now rendered in graphical format. The user can learn more about individual elements displayed in this view by simply hovering the cursor over any of the elements in the display; one such example is shown in the pop-up box at the bottom right, for the phosphorylation site at position 1267 of the sequence. Zoom and navigational controls are at the top of the view window, allowing the user to understand this gene within its broader genomic context.

From here, the user can also enter the structural realm by examining the protein structures that are available through the Discovery Column. Clicking on the See all 9 structures link takes the user to the view shown in Figure 2.10, listing structural entries related to the netrin receptor. The second entry is for the crystal structure of a fragment of netrin-1 complexed with the DCC receptor (PDB:4URT; Finci et al. 2014), and clicking on the title of that entry takes the user to the structure summary page shown in Figure 2.11. Starting on the right, the Interactions window shows the relationships between the individual elements in this biological unit, here consisting of the netrin-1 protein (circle A), the DCC receptor (circle B), and five different chemical entities (diamonds 1–5). The three-dimensional structure is shown in the left panel, and the structure can be further interrogated by clicking on the square with the diagonal arrow in the bottom left of that panel. This action will launch iCn3D (for “I see in three-D”), a web-based viewer that allows the structure to be rotated, provides coloring and rendering options to enhance visualization, and provides a wide variety of additional options; the reader is referred to the iCn3D online documentation for specifics. In the upper right of the 4URT structure summary page is a link to similar structures, as determined by VAST+. Clicking on the VAST+ link produces the output shown in Figure 2.12, here showing the first 10 of 256 structures deemed to have similar biological units to the query (4URT); the table shown here is sorted by RMSD of all aligned residues (in Å), from smallest to largest.


Figure 2.10 Protein structures associated with the RefSeq entry for the human netrin receptor shown in Figures 2.8 and 2.9. The description of each structure is hyperlinked, allowing the user to access the structure summary page for that entry (see Figure 2.11). Individual links below each entry allow quick access to related structures and proteins, information on conserved domains, and the iCn3D viewer.


Figure 2.11 The structure summary page for pdb:4URT, the crystal structure of a fragment of netrin-1 complexed with the DCC receptor (Finci et al. 2014). The entry shows header information from the corresponding Molecular Modeling Database (MMDB) entry, a link to the paper reporting this structure, and the methodology used to determine this structure (here, X-ray diffraction with a resolution of 3.1 Å). See text for details.

Bioinformatics

Подняться наверх