Читать книгу Bioinformatics - Группа авторов - Страница 25
UniProt
ОглавлениеAlthough data repositories are an essential vehicle through which scientists can access sequence data as quickly as possible, it is clear that the addition of biological information from multiple, highly regarded sources greatly increases the power of the underlying sequence data. The UniProt Consortium was formed to accomplish just that, bringing together the Swiss-Prot, TrEMBL, and the Protein Information Resource Protein Sequence Database under a single umbrella, called UniProt (UniProt Consortium 2017). UniProt comprises three main databases: the UniProt Archive, a non-redundant set of all publicly available protein sequences compiled from a variety of source databases; UniProtKB, combining entries from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL; and the UniProt Reference Clusters (UniRef), containing non-redundant views of the data contained in UniParc and UniProtKB that are clustered at three different levels of sequence identity (Suzek et al. 2015).
Figure 1.2 Results of a search for the human heterogeneous nuclear ribosomal protein A1 record within UniProtKB, using the accession number P09651 as the search term. See text for details.
The wealth of information found within a UniProtKB entry can be best illustrated by an example. Here, we will consider the entry for the human heterogeneous nuclear ribonuclear protein A1, with accession number P09651. A search of UniProtKB using this accession number as the search term produces the view seen in Figure 1.2. The lower part of the left-hand column shows the various types of information available for this protein, and the user can select or de-select sections based on their interests. The main part of the window provides basic identifying information about this sequence, as well as an indication of whether the entry has been manually reviewed and annotated by UniProtKB curators. Here, we see that the entry has indeed been reviewed and that there is experimental evidence that supports the existence of the protein. The next section in the file is devoted to conveying functional information, also providing Gene Ontology (GO) terms that are associated with the entry, as well as links to enzyme and pathway databases such as Reactome (see Chapter 13). Clicking on any of the blue tiles in the left-hand column will jump the user down to the selected section of the entry. For instance, if one clicks on Subcellular location, the view seen in Figure 1.3 is produced, providing a color-coded schematic of the cell indicating the type of annotation (manual or automatic) and links to publications supporting the annotation. The lower part of Figure 1.3 also shows information regarding the protein's involvement in disease, documenting variants that have been implicated in early onset Paget disease and amyotrophic lateral sclerosis (Kim et al. 2013; Liu et al. 2016).
Figure 1.3 The Subcellular location and Pathology & Biotech sections of the record for the human heterogeneous nuclear ribosomal protein A1 record within UniProtKB. These sections can be accessed by clicking on the blue tiles in the left-hand column of the window. See text for details.
In the upper left corner of the UniProtKB window are display options that are quite useful in visualizing the significant amount of data found in this entry's feature table. By clicking on Feature viewer, one is presented with the view shown in Figure 1.4, neatly summarizing the annotations for this sequence in a coordinate-based fashion. Any of the sections can be expanded by clicking on the labels in the blue boxes to the left of the graphic. Here, the post-translational modification (PTM) section has been expanded, showing the position of modified residues in this protein; clicking on any of the markers in the track will produce a pop-up with additional information on the PTM, along with relevant links to the literature. In Figure 1.5, the Structural features and Variants sections have also been expanded, showing the positions of all alpha helices, beta strands, and beta turns within the protein, as well as the location of putatively clinically relevant point mutations. Here, a variant at position 351 is highlighted, with the proline-to-leucine variant identified as part of the ClinVar project (Landrum et al. 2016) having a possible association with relapsing–remitting multiple sclerosis. By examining different sections of this very useful graphical display, the user can start to see how various features overlap with one another, perhaps indicating whether a known or predicted disease-causing variant falls within a structured region of the protein. These annotations and observations can provide important insights with respect to experimental design and the interpretation of experimental data.
Figure 1.4 The Feature viewer rendering of the record for the human heterogeneous nuclear ribosomal protein A1 within UniProtKB. Clicking the Display link, found in the upper left portion of the window, provides access to the Feature viewer. Any of the sections can be expanded by clicking on the labels in the blue boxes to the left of the graphic. See text for details.
Figure 1.5 Expanding the PTM, Structural features, and Variants sections within the Feature viewer display shows the position of all post-translational modifications (PTMs), alpha helices, beta strands, and beta turns within the human heterogeneous nuclear ribosomal protein A1, as well as the location of putatively clinically relevant point mutations. Clicking on any of the variants produces a pop-up window with additional information; here, the pop-up window provides disease association data for the proline-to-leucine variant at position 351 of the sequence. See text for details.