Читать книгу Bioinformatics - Группа авторов - Страница 76
Introduction
ОглавлениеThe first complete sequence of a eukaryotic genome – that of Saccharomyces cerevisiae – was published in 1996 (Goffeau et al. 1996). The chromosomes of this organism, which range in size from 270 to 1500 kb, presented an immediate challenge in data management, as the upper limit for single database entries in GenBank at the time was 350 kb. To better manage the yeast genome sequence, as well as other chromosome and genome-length sequences being deposited into GenBank around that time, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) established the Genomes division of Entrez (Benson et al. 1997). Entries in this division were organized around a reference sequence onto which all other sequences from that organism were aligned. As these reference sequences have no size limit, “virtual” reference sequences of large genomes or chromosomes could be assembled from shorter GenBank sequences. For partially sequenced chromosomes, NCBI developed methods to integrate genetic, physical, and cytogenetic maps onto the framework of the whole chromosome. Thus, Entrez Genomes was able to provide the first graphical views of large-scale genomic sequence data.
The working draft of the human genome, completed in February 2001 (Lander et al. 2001), generated virtual reference sequences for each human chromosome, ranging in size from 46 to 246 Mb. NCBI created the first version of its human Map Viewer (Wheeler et al. 2001) shortly thereafter, in order to display these longer sequences. Around the same time, the University of California, Santa Cruz (UCSC) Genome Bioinformatics Group was developing its own human genome browser, based on software originally designed for displaying the much smaller Caenorhabditis elegans genome (Kent and Zahler 2000). Similarly, the Ensembl project at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) was also producing a system to automatically annotate the human genome sequence, as well as store and visualize the data (Hubbard et al. 2002). The three genome browsers all came online at about the same time, and researchers began using them to help navigate the human genome (Wolfsberg et al. 2002). Today, each site provides free access not only to human sequence data but also to a myriad of other assembled genomic sequences, from commonly used model organisms such as mouse to more recently released assemblies such as those of the domesticated turkey. Although the NCBI's Map Viewer is not being further developed and will be replaced by its new Genome Data Viewer (Sayers et al. 2019), the UCSC and Ensembl Genome Browsers continue to be popular resources, used by most members of the bioinformatics and genomics communities. This chapter will focus on the last two genome browsers.
The reference human genome was sequenced in a clone-by-clone shotgun sequencing strategy and was declared complete in April 2003, although sequencing of selected regions is still continuing. This strategy includes constructing a bacterial artificial chromosome (BAC) tiling map for each human chromosome, then sequencing each BAC using a shotgun sequencing approach (reviewed in Green 2001). The sequences of individual BACs were deposited into the High Throughput Genomic (HTG) division of GenBank as they became available. UCSC began assembling these BAC sequences into longer contigs in May 2000 (Kent and Haussler 2001), followed by assembly efforts undertaken at NCBI (Kitts 2003). These contigs, which contained gaps and regions of uncertain order, became the basis for the development of the genome browsers. Over time, as the genome sequence was finished, the human genome assembly was updated every few months. After UCSC stopped producing its own human genome assemblies in August 2001, NCBI built eight reference human genome assemblies for the bioinformatics community, culminating with a final assembly in March 2006. Subsequently, an international collaboration that includes the Wellcome Trust Sanger Institute (WTSI), the Genome Institute at Washington University, EBI, and NCBI formed the Genome Reference Consortium (GRC), which took over responsibility for subsequent assemblies of the human genome. This consortium has produced two human genome assemblies, namely GRCh37 in February 2009 and GRCh38 in December 2013. As one might expect, each new genome assembly leads to changes in the sequence coordinates of annotated features. In between the release of major assemblies, GRC creates patches, which either correct errors in the assembly or add alternate loci. These alternate loci are multiple representations of regions that are too variable to be represented by a single reference sequence, such as the killer cell immunoglobulin-like receptor (KIR) gene cluster on chromosome 19 and the major histocompatibility complex (MHC) locus on chromosome 6. Unlike new genome assemblies, patches do not affect the chromosomal coordinates of annotated features. GRCh38.p10 has 282 alternate loci or patches.
While the GRC also assembles the mouse, zebrafish, and chicken genomes, other genomes are sequenced and assembled by specialized sequencing consortia. The panda genome sequence, published in 2009, was the first mammalian genome to abandon the clone-based sequencing strategies used for human and mouse, relying entirely on next generation sequencing methodologies (Li et al. 2010). Subsequent advances in sequencing technologies have led to rapid increases in the number of complete genome sequences. At the time of this writing, both the UCSC Genome Browser and the main Ensembl web site host genome assemblies of over 100 organisms. The look and feel of each genome browser is the same regardless of the species displayed; however, the types of annotation differ depending on what data are available for each organism.
The backbone of each browser is an assembled genomic sequence. Although the underlying genomic sequence is, with a few exceptions, the same in both genome browsers, each team calculates its annotations independently. Depending on the type of analysis, a user may find that one genome browser has more relevant information than the other. The location of genes, both known and predicted, is a central focus of both genome browsers. For human, at present, both browsers feature the GENCODE gene predictions, an effort that is aimed at providing robust evidence-based reference gene sets (Harrow et al. 2012). Other types of genomic data are also mapped to the genome assembly, including NCBI reference sequences, single-nucleotide polymorphisms (SNPs) and other variants, gene regulatory regions, and gene expression data, as well as homologous sequences from other organisms. Both genome browsers can be accessed through a web interface that allows users to navigate through a graphical view of the genome. However, for those wishing to carry out their own calculations, sequences and annotations can also be retrieved in text format. Each browser also provides a sequence search tool – BLAT (Kent 2002) or BLAST (Camacho et al. 2009) – for interrogating the data via a nucleotide or protein sequence query. (Additional information on both BLAT and BLAST is provided in Chapter 3.)
In order to provide stability and ensure that old analyses can be reproduced, both genome browsers make available not only the current version of the genome assemblies but older ones as well. In addition, annotation tracks, such as the GENCODE gene track and the SNP track, may be based on different versions of the underlying data. Thus, users are encouraged to verify the version of all data (both genome assembly and annotations) when comparing a region of interest between the UCSC and Ensembl Genome Browsers.
This chapter presents general guidelines for accessing the genome sequence and annotations using the UCSC and Ensembl Genome Browsers. Although similar analyses could be carried out with either browser, we have chosen to use different examples at the two sites to illustrate different types of questions that a researcher might want to ask. We finish with a short description of JBrowse (Buels et al. 2016), another web-based genome browser that users can set up on their own servers to share custom genome assemblies and annotations. All of the resources discussed in this chapter are freely available.