Читать книгу Bioinformatics - Группа авторов - Страница 82

ENSEMBL Genome Browser

Оглавление

The Ensembl Genome Browser (Cunningham et al. 2019) got its start in 1999 (Hubbard et al. 2002) with the display of the human genome assembly. Like the UCSC Genome Browser, it has grown significantly over the years. The main Ensembl site focuses on vertebrates and includes assemblies from almost 90 species. Ensembl has also created specialized sibling databases for other groups of organisms, including EnsemblPlants (nearly 50 species), EnsemblMetazoa (nearly 70 species), EnsemblProtist (more than 100 species), and EnsemblFungi (more than 800 species), and the very large EnsemblBacteria, with around 44 000 species. The amount of available genome data and annotations varies by organism, but the general browser navigation principles are the same for all. An additional resource is Pre!Ensembl, which displays genomes that are in the process of being annotated. Genomes on this site have an assembly and BLAST interface but, for the most part, no gene predictions.

Like the UCSC Genome Browser, the Ensembl Browser makes available multiple versions of genome assemblies. Integrated into the assemblies may be gene, genome variation, gene regulation, and comparative genomics annotation. Annotations are organized as sets of tracks. Ensembl incorporates data from a variety of public sources, including NCBI, UCSC, model organism databases, and more, and updates data and software in a formal release process, which can be tracked by release number. Importantly, previous Ensembl releases are archived on the web site and are available for view. Thus, even after a genome assembly or annotation set has been updated, it is possible to view the older data using all the regular functions of the Ensembl web site. This archive process sets Ensembl apart from UCSC, where the genome assembly remains stable, but the annotations may change on a weekly basis. Each Ensembl page has a link at the bottom called View in archive site. The archive site provides links to older versions of that page, including previous annotation sets on the same genome assembly, as well as prior genome assemblies.

The Ensembl Browser provides many of the same types of resources and tools as does the UCSC Genome Browser. Sequences can be aligned to the assembled genomes using either BLAT or BLAST, and data can be returned in various tabular formats using BioMart (Kinsella et al. 2011). Data and software can be retrieved from the Downloads menu, available from most browser pages. In the Tools menu, Ensembl provides a number of additional tools to manipulate data, including the Variant Effect Predictor (VEP) (McLaren et al. 2016), which predicts functional consequences of known and unknown variants, File Chameleon, which reformats files available on the Ensembl FTP site, and Assembly Converter, which is like UCSC's liftOver and is used to convert coordinates between genome assemblies. The Help & Documentation menu provides substantial written and video-based information about how to navigate and interpret the Ensembl site, far beyond the level of detail presented in this chapter.

Ensembl also provides ways for users to upload their data into the browser. Properly formatted tracks can be added to the display by selecting the Custom tracks option from the left side of any species-specific page. The data can be uploaded to Ensembl from a file on the user's computer or, if it is saved on a web server, the browser can read it from a URL. Users who create an account at Ensembl can save track data to the Ensembl database server and view them later from any computer. To share custom tracks or even a customized view of the Genome Browser with colleagues, click on the Share this Page link on the left sidebar. Ensembl also supports Track Hubs, both public ones that are registered on the EMBL-EBI Track Hub Registry as well as private ones.


Figure 4.13 The home page of the Ensembl Genome Browser, showing a query for the human gene PAH. The browser suggests results based on the search term submitted. By default, the search box interfaces with the most recent version of the genome assembly, GRCh38, at the time of this writing. A link to the previous human genome assembly, GRCh37, is provided at the bottom of the page. Older assemblies from other organisms are available in the Ensembl archives.


Figure 4.14 The Gene tab for the human PAH gene. This landing page provides links to many gene-specific resources.

Like the UCSC Genome Browser home page, the home page of Ensembl is a stepping-off point for many Ensembl resources. Links to commonly used tools, such as BLAST and BLAT, are provided on the top and middle sections of the page, and recent data updates are highlighted in the right column. The home page for each genome can be accessed by selecting the organism name in the pull-down menu in the Browse a Genome section in the center of the page. A search box at the top of the page provides access to Ensembl. To search for the human PAH gene, select Human from the pull-down menu and type the term PAH in the search box. Ensembl will provide several suggested hits, including a direct link to the human PAH gene (Figure 4.13).

Ensembl data displays are organized in tabs. The Gene tab (Figure 4.14) has links to a number of gene-specific views and resources. For example, from the index on the left side of the Gene tab view, the Comparative Genomics → Orthologues link lists the computationally predicted orthologs of the selected gene that Ensembl has identified among the available genome assemblies (Herrero et al. 2016; Figure 4.15). The Location tab provides a graphical view of the genomic context of the gene, similar to the view available at UCSC. The link to the Location tab is at the top of the Gene tab view in Figure 4.14. The Location tab view is shown in Figure 4.16 and depicts, at three different zoom levels, the genomic context of the PAH gene on the GRCh38 genome assembly. The PAH gene has been mapped to chromosome 12, and the top panel shows a cartoon of that chromosome, with the region surrounding the PAH gene outlined in a red box. This red box is expanded in the middle panel of the figure, which shows ∼1 Mb of chromosome 12 around the PAH gene. The genes are shown as colored blocks, with their identifiers noted below them. The region outlined in red in this middle section is further expanded in the large bottom panel, which zooms in on the PAH gene itself. Individual tracks are visible in this view. Note the track called Contigs, a blue bar that represents the underlying assembled contigs. By convention, any transcripts shown above this track are transcribed from left to right. Transcripts drawn below the Contigs track, such as the PAH transcripts, are transcribed on the opposite strand, from right to left.


Figure 4.15 Computationally predicted orthologs of the human PAH gene, from the Comparative GenomicsOrthologues link in Figure 4.14. Ensembl provides a detailed analysis of the orthologs calculated for each gene. Orthologs are grouped by species, such as primates, rodents, and sauropsids. Links to individual orthologs are shown at the bottom of the page.

The default human gene set used by Ensembl is the GENCODE Comprehensive set (Box 4.2). Ensembl displays 18 PAH isoforms, each with a slightly different pattern of exons (Figure 4.16). Coding exons are depicted as solid blocks, non-coding exons as outlined blocks, and introns are the lines that connect them. The transcripts are color coded to indicate their status: gold transcripts are protein coding and have been annotated by both the Ensembl and HAVANA team at the WTSI, red transcripts are protein coding and have been annotated by either Ensembl or HAVANA, and blue transcripts are processed transcripts that are non-protein coding. Clicking on a transcript pops up a box with additional information about that feature, including its accession number, and, for a transcript, the transcript type and gene prediction source (Box 4.4; Figure 4.16).


Figure 4.16 The Location tab for the human PAH gene. The Location tab is divided into three sections. The top section shows a cartoon of human chromosome 12, with the region surrounding the PAH gene outlined in a red box. Other red and green lines on the cartoon indicate assembly exceptions, or regions of alternative sequence that differ from the primary assembly because of allelic sequence or incorrect sequence, as determined by the Genome Reference Consortium. The Region in detail shows a zoomed-in view of the region outlined by the red box in the top section of the page. Genes are indicated by rectangles, colored as described in the gene legend below the graphic. The gene identifiers, along with the direction of transcription, are shown below the rectangles. The bottom section shows a zoomed-in view of the region surrounded by the red box in the Region in detail. The blue bar represents the genomic contig in this region. In the Genes track, genes above the bar are transcribed from left to right; those below the contig are transcribed from right to left. A few of the PAH transcripts, which are transcribed from right to left, are visible in this view. Gold transcripts are merged HAVANA/Ensembl transcripts; red are Ensembl protein-coding transcripts; blue transcripts are non-protein-coding processed transcripts. The pop-up display, activated when clicking on a particular transcript, shows the details for the first transcript in the Genes track, PAH-215.

Bioinformatics

Подняться наверх