Читать книгу Bioinformatics - Группа авторов - Страница 79
Box 4.2 GENCODE
ОглавлениеThe GENCODE gene set was originally developed by the ENCODE Consortium as a comprehensive source of high-quality human gene annotations (Harrow et al. 2012). It has now been expanded to include the mouse genome (Mudge and Harrow 2015). The goal of the GENCODE project is to include all alternative splice variants of protein-coding loci, as well as non-coding loci and pseudogenes. The GENCODE Consortium uses computational methods, manual curation, and experimental validation to identify these gene features. The first step is carried out by the same Ensembl gene annotation pipeline that is used to annotate all vertebrate genomes displayed at Ensembl (Aken et al. 2016). This pipeline aligns cDNAs, proteins, and RNA-seq data to the human genome in order to create candidate transcript models. All Ensembl transcript models are supported by experimental evidence; no models are created solely from ab initio predictions. The Human and Vertebrate Analysis and Annotation (HAVANA) group produces manually curated gene sets for several vertebrate genomes, including mouse and human. These manually curated genes are merged with the Ensembl transcript models to create the GENCODE gene sets for mouse and human. A subset of the human models has been confirmed by an experimental validation pipeline (Howald et al. 2012).
The consortium makes available two types of GENCODE gene sets. The Comprehensive set encompasses all gene models, and may include many alternatively spliced transcripts (isoforms) for each gene. The Basic set includes a subset of representative transcripts for each gene that prioritizes full-length protein-coding transcripts over partial- or non-protein-coding transcripts. The Ensembl Genome Browser displays the Comprehensive set by default. Although the UCSC Genome Browser displays the Basic set by default, the Comprehensive set can be selected by changing the GENCODE track settings. At the time of this writing, Ensembl is displaying GENCODE v27, released in August 2017. The GENCODE version available by default at the UCSC Genome Browser is v24, from December 2015. More recent versions of GENCODE can be added to the browser by selecting them in the All GENCODE super-track.
GENCODE and RefSeq both aim to provide a comprehensive gene set for mouse and human. Frankish et al. (2015) have shown that, in human, the RefSeq gene set is more similar to the GENCODE Basic set, while the GENCODE Comprehensive set contains more alternative splicing and exons, as well as more novel protein-coding sequences, thus covering more of the genome. They also sought to determine which gene set would provide the best reference transcriptome for annotating variants. They found that the GENCODE Comprehensive set, because of its better genomic coverage, was better for discovering new variants with functional potential, while the GENCODE Basic set may be better suited for applications where a less complex set of transcripts is needed. Similarly, Wu et al. (2013) compared the use of different gene sets to quantify RNA-seq reads and determine gene expression levels. Like Frankish et al., they recommend using less complex gene annotations (such as the RefSeq gene set) for gene expression estimates, but more complex gene annotations (such as GENCODE) for exploratory research on novel transcriptional or regulatory mechanisms.
In the GENCODE track, as well as other gene tracks, exons (regions of the transcript that align with the genome) are depicted as blocks, while introns are drawn as the horizontal lines that connect the exons. The direction of transcription is indicated by arrowheads on the introns. Coding regions of exons are depicted as tall blocks, while non-coding exons are shorter. In this example, the GENCODE track depicts five alternatively spliced transcripts, labeled HIF1A on the left, for the HIF1A gene. As shown by the arrowheads, all transcripts are transcribed from left to right. The 5′-most exon of each transcript (on the left side of the display) is shorter on the left, indicating an untranslated region (UTR), and taller on the right, indicating a coding sequence. The reverse is true for the 3′-most exon of each transcript. A very close visual inspection of the Genome Browser shows that the last four HIF1A transcripts have a different pattern of exons from each other; a BLAST search (not shown) reveals that first two transcripts differ by only three nucleotides in one exon. There is also a transcript labeled HIF1A-AS2, an anti-sense HIF1A transcript that is transcribed from right to left. Another transcript, labeled RP11-618G20.1, is a synthetic construct DNA. Zooming the display out by 3× allows a view of the genes immediately upstream and downstream of HIF1A (Figure 4.3). A second HIF1A antisense transcript, HIF1A-AS1, is also visible.
The track below the GENCODE track is the RefSeq gene predictions from NCBI track. This is a composite track showing human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq; Box 1.2). By default, the RefSeq track is shown in dense mode, with the exons of the individual transcripts condensed into a single line (Figure 4.2). Note that, in this dense mode, the exons are displayed as blocks, as in the GENCODE track, but there are no arrowheads on the gene model to show the direction of transcription. To change the display of the RefSeq track to view individual transcripts, open the Track Settings page for the NCBI RefSeq track by clicking on the track name in the first row of the Genes and Gene Predictions section (below the graphical view shown in Figure 4.2). The resulting Track Settings page (Figure 4.4) allows the user to choose which type of RefSeqs to display (e.g. all, curated only, or predicted only). In this example, we change the mode of the RefSeq Curated track from dense to full, and the resulting graphical view (Figure 4.5) displays each curated RefSeq as a separate transcript. In contrast to the GENCODE track, there are only three RefSeq transcripts for the HIF1A gene, and the HIF1A-AS2 RefSeq transcript is much shorter than the GENCODE transcript with the same name. These discrepancies are due to differences in how the RefSeq and GENCODE transcript sets are assembled (Boxes 1.2 and 4.2).
Figure 4.3 The genomic context of the human HIF1A gene, after clicking on zoom out 3×. The genes immediately upstream (FLJ22447) and downstream (SNAPC1) of HIF1A are now visible.
Figure 4.4 The RefSeq Track Settings page. The track settings pages are used to configure the display of annotation tracks. By default, all of the RefSeq tracks are set to display in dense mode, with all features condensed into a single line. In this example, the Curated RefSeqs are being set to display in full mode, in which each RefSeq transcript will be labeled and displayed on a separate line. The remainder of the RefSeqs will be displayed in dense mode. The types of RefSeqs, curated and predicted, are described in Box 1.2. After changing the settings, press the submit button to apply them.
Additional information about each transcript in the GENCODE and RefSeq tracks is available by clicking on the gene symbol (HIF1A, in this case); as the original search was for HIF1A, the gene name is highlighted in inverse type. For GENCODE genes, UCSC has collected information from a variety of public sources and includes a text description, accession numbers, expression data, protein structure, Gene Ontology terms, and more. For RefSeq transcripts, UCSC provides links to NCBI resources. Both GENCODE and RefSeq details pages provide a link to Genomic Sequence in the Sequence and Links section, allowing users to retrieve genomic sequences connected to an individual transcript. From the selection menu (Figure 4.6), users can choose whether to download the sequence upstream or downstream of the gene, as well as the exon or intron sequence. The sequence is returned in FASTA format.
Figure 4.5 The genomic context of the human HIF1A gene, after displaying RefSeq Curated genes in full mode. Each RefSeq transcript is now drawn on a separate line, so that individual exons, as well as the direction of transcription, are visible. Compare this rendition with Figure 4.2, where all RefSeq transcripts are condensed on a single line.
Figure 4.6 The Get Genomic Sequence page that provides an interface for users to retrieve the sequence for a feature of interest. Click on an individual transcript in the GENCODE or RefSeq track to open a page with additional details for that transcript. On either of those details pages, click the link for Genomic Sequence to open the page displayed here, which provides choices for retrieving sequences upstream or downstream of the transcript, as well as intron or exon sequences. In this example, retrieve the sequence 1000 nt upstream of the annotated transcription start site. Shown in the inset is the result of retrieving the FASTA-formatted sequence 1000 nt upstream of the HIF1A transcript.
Further down on the graphical view shown in Figure 4.3 are tracks from the ENCODE Regulation super-track: Layered H3K27Ac and DNase Clusters. These data were generated by the Encyclopedia of DNA Elements (ENCODE) Consortium between 2003 and 2012 (ENCODE Project Consortium 2012). The ENCODE Consortium has developed reagents and tools to identify all functional elements in the human genome sequence. The Layered H3K27Ac track indicates regions where there are modified histones that may indicate active enhancers (Box 4.3).