Читать книгу Bioinformatics - Группа авторов - Страница 83

Box 4.4 Ensembl Stable IDs

Оглавление

Ensembl assigns accession numbers to many data types in its database. Each identifier begins with the organism prefix; for human, the prefix is ENS; for mouse, it is ENSMUS; and for anole lizard, it is ENSACA. Next comes an abbreviation for the feature type: G for gene, T for transcript, P for protein, R for regulatory, and so forth. This is followed by a series of digits, and an optional version. The version number increments when there is a change in the underlying data. The gene version changes when the underlying transcripts are updated, and the transcript and protein versions increment when the sequence changes.

For example, the human PAH gene has the following identifiers:

 ENSG00000171759.9: the identifier of the human PAH gene

 ENST00000553106.5: the identifier of one transcript of the human PAH gene, transcript PAH-215

 ENSP00000448059.1: the identifier of the protein translation of transcript PAH-215, ENST00000553106.5

 ENSR00000056420: the identifier of a promoter of several PAH transcripts

Navigation controls between the second and third panels of the Location tab allow the display to be zoomed or moved to the left or right. The blue bar at the top of the Region in detail allows users to toggle between Drag and Select. When the Drag option is highlighted, click on the graphical view window and drag it to the left or right to change the location. When the Select option is highlighted, click on a region of interest in the graphical view, then, holding the mouse button down, scroll to the left or right to highlight the region (Figure 4.17a). The highlight can be left on for visualization purposes or, alternatively, select Jump to region to zoom in to the selected region. Figure 4.17b shows the results of zooming in to the last exon of transcript PAH-203; since the gene is transcribed from right to left, the last exon is on the left. Note the track called All phenotype-associated short variants (SNPs and indels) that contains those variants that have been associated with a phenotype or disease. SNPs are color coded by function, with dark green indicating coding sequence variants. Select the dark green SNP, highlighted with a red box near the left end of the window, and follow the link for additional information. The resulting Variant tab provides links to SNP-related resources. For example, the Phenotype Data for this SNP (rs76296470; Figure 4.18a) shows that this variant is pathogenic and is associated with the disease phenylketonuria. The most severe consequence for this SNP is a stop gained. Further details about the consequences are available under the Genes and regulation link (Figure 4.18b) on the left sidebar. This variant is found in 10 transcripts of the PAH gene. In five of those transcripts, it alters one nucleotide in a codon, changing an arginine to a stop codon, thus truncating the PAH protein. In the other five transcripts, either the variant is downstream of the gene or the transcript is non-coding.

Ensembl makes available many annotation tracks through the Configure this page link on the left sidebar. There are over 500 tracks available for display on GRCh38, with the majority falling in the categories of Variation, Regulation, and Comparative Genomics. The Ensembl Regulatory Build includes regions that are likely to be involved in gene regulation, including promoters, promoter flanking regions, enhancers, CCCTC-binding factor (CTCF) binding sites, transcription factor binding sites (TFBS), and open chromatin regions (Zerbino et al. 2016). A summary Regulatory Build track is turned on by default in the Location tab, and the display of individual features can be adjusted in the Configure this page menu. In the UCSC Genome Browser, the GTEx track shows that the PAH gene is highly expressed in liver and kidney (Figure 4.10); the epigenetic factors that may be controlling this activity can be viewed in Ensembl Regulatory Build. To view these factors, navigate to RegulationHistones & polymerases on the Configure this page menu, mouse over the HepG2 human liver carcinoma line, and select All features for HepG2 (Figure 4.19a). In addition, navigate to RegulationOpen chromatin & TFBS and confirm that the DNase1 track is in its default state for HepG2; the dark blue indicates that the track is shown. Close the Configure this page menu by clicking on the check mark in the upper right corner of the pop-up window. Notice that the Regulatory Build track has now expanded to include the selected gene regulatory marks in the HepG2 cell line. Zoom in on the first exon of transcript PAH-215 to see the promoter region of this gene, being mindful of the orientation of the gene (Figure 4.19b). The solid red rectangle in the Regulatory Build track shows the location of the PAH promoter. The presence of a DNaseI hypersensitive site along with the activating histone marks of H3K27Ac, H3K4me1, H3K4me2, H3K4me3, H3K79me2, and H3K9Ac may help to explain why this gene is highly expressed in liver cells (Box 4.3). Detailed information about features in the Regulatory Build track, such as the source of the data, is available under the Regulation tab. Click on the feature and select its identifier (the letters ENSR, followed by numbers) to open this tab.


Figure 4.17 Zooming in on the bottom section of the Location tab from Figure 4.16. (a) Highlight a region of interest, the final exon of PAH transcript PAH-203, by clicking the mouse and then scrolling to the left or right. In order to highlight the region, the Drag/Select toggle in the blue bar at the top of the section must first be set to Select. (b) To zoom in to the highlighted region, select Jump to region. It may take a few iterations to create the view in this figure. At the bottom of the window is a track labeled All phenotype-associated – short variants (SNPs and indels). In this track, the SNP rs76296470 has been manually highlighted in red.


Figure 4.18 The Ensembl Variant tab. (a) To get more details about SNP rs76296470, click on the dark green SNP that is highlighted in red in the All phenotype-associated – short variants (SNPs and indels) track in Figure 4.17b. On the pop-up menu, click on more about rs76296470. The Phenotype Data section of the Variant tab is available from the link in the blue sidebar. This variant is pathogenic for phenylketonuria. (b) The Genes and regulation section of the Variant tab shows the location and function of the variant in the transcripts that overlap it. Depending on the transcript, the SNP can change a codon to a stop codon (stop gained), map downstream of a gene, or map to a non-coding transcript. The transcripts in this view represent alternatively spliced forms of the gene PAH.


Figure 4.19 The Ensembl Regulatory Build track. (a) Go to Configure this page on the left side of the Location tab and select RegulationHistones & polymerases. Scroll to the right to find the HepG2 (human liver cancer) cell type. Mouse over the text HepG2 and turn on all features. Clicking on the box under the cell type will change the track style; leave that set to the default of Peaks. Click on the black check mark on the upper right corner of the configuration window to save the settings and exit the setup. To turn on the DNase1 (DNaseI hypersensitive sites track), select RegulationOpen chromatin & TFBS and ensure that the DNase1 box in the HepG2 column is colored dark blue so that it is in the Shown configuration. Click on the black check mark on the upper right corner of the configuration window to save the settings again. (b) Back on the Region in detail section of the Location tab, zoom in to the first exon of transcript PAH-215. Note that the first exon is on the right end of the transcript, as the gene is transcribed from right to left. The resulting display shows the details of the Regulatory Build track. The figure legend (not shown) explains that the solid red box is a promoter. The DNaseI hypersensitive site and histone marks are also shown as colored boxes.

The left sidebar of the Location tab links to a number of additional useful resources. One of those, Comparative GenomicsSynteny displays blocks of synteny between the human chromosome featured in the Location tab and chromosomes from about 30 different organisms. In these syntenic blocks, the order of genes and other sequence features is conserved across the genomes being compared. Figure 4.20a shows the synteny between human chromosome 12 and the mouse genome. A cartoon of the human chromosome 12 is shown in the center of the display as a thick white rectangle, and mouse chromosomes are drawn on the sides as thinner white rectangles. Colored rectangles indicate regions of synteny between the human and mouse. For example, the light blue region on human chromosome 12 is syntenic to the light blue region on mouse chromosome 10. The region surrounding the PAH gene is outlined in red on both human chromosome 12 and mouse chromosome 10. Below the cartoon is a list of the human genes and corresponding mouse orthologs in the region of PAH. Selecting Region Comparison next to one of the genes opens a new Location tab that depicts the syntenic human and mouse chromosomes stacked on top of each other so that surrounding features can be compared directly (Figure 4.20b). The upper panel shows the genomic context of the PAH gene on human chromosome 12 (top) and mouse chromosome 10 (bottom). Note that the genes are transcribed in opposite directions, so the order of the surrounding genes is flipped. The bottom panel is zoomed in on the PAH gene itself. The Regulatory Build track on the mouse assembly shows several regulatory features in this region. Further inspection of the regulatory feature that overlaps with the 5′ end of the mouse Pah gene reveals activating histone marks in liver and kidney cells, but not in other cell types (not shown), implying that the mouse Pah gene has similar expression patterns to its human ortholog. To reset the settings back to the default view, go to Configure this page in the left sidebar and select Reset configuration.

Figure 4.20 The Synteny view at Ensembl. (a) An overview of the syntenic blocks shared between human chromosome 12 and the mouse genome. The human chromosome is drawn in the middle of the display as a thick white box. The syntenic mouse chromosomes are represented by thinner white boxes along the side. The colored rectangles highlight regions of synteny between the human and mouse. A red outline illustrates the position of the PAH gene on the blue region of human chromosome 12 and on the blue region of mouse chromosome 10. (b) The Location tab for the PAH gene showing both the human and mouse syntenic regions. This is similar to the three-panel location tab shown in Figure 4.16, except that both the human and mouse genomes are depicted. The top panel (not shown) displays the full length human chromosome 12 and mouse chromosome 10. The second panel shows an overview of the genes in the region. The third panel focuses in on the PAH gene. Note that the regions in human and mouse appear to be presented in opposite orientations; in human, the PAH and IGF1 genes are both transcribed from right to left, while in mouse they are transcribed from left to right.


Figure 4.21 Ensembl BLAST output, showing an alignment between the human ADAM18 protein and the lizard genome translated in all six reading frames. On the BLAST/BLAT page at Ensembl, paste the FASTA-formatted sequence of human ADAM18, accession NP_001307242.1, into the Sequence data box. This sequence can be found at www.ncbi.nlm.nih.gov/protein/NP_001307242.1/?report=fasta. Select Genomic sequence from the anole lizard as the DNA database. On the results page, select the Alignment link next to the highest scoring hit in order to view the sequence alignment. The human protein sequence is on top, and the translated lizard genomic sequence is below. Lines indicate identical amino acids.

The Ensembl sequence data can also be queried via a BLAT or BLAST search by following the link at the top of any page. Earlier in this chapter, Figure 4.11 outlined how to use BLAT to look for a lizard homolog of the human ADAM18 gene. Ensembl data can be searched by the more sensitive BLAST algorithm, including the TBLASTN program that is used to compare a protein query with a nucleotide database translated in all six reading frames. Copy and paste the FASTA-formatted protein sequence of NCBI RefSeq NP_001307242.1 into the Sequence data box on the BLAST page and carry out a TBLASTN search against the anole lizard genomic sequence. The sequence alignment of the top hit is shown in Figure 4.21. The human protein query is on the top line, and the translated lizard genomic sequence on the second. The sequences share only 32% sequence identity, but the alignment spans 650 amino acids, and some key sequence features are conserved; note the alignment of almost every cysteine residue. Thus, this lizard genomic sequence is indeed a homolog of human ADAM18. The BLAST algorithm, although about two orders of magnitude slower than BLAT for the same query, is able to find a lizard ortholog of the human protein.

Bioinformatics

Подняться наверх