Читать книгу Bioinformatics - Группа авторов - Страница 14
Introduction
ОглавлениеOver the past several decades, there has been a feverish push to understand, at the most elementary of levels, what constitutes the basic “book of life.” Biologists (and scientists in general) are driven to understand how the millions or billions of bases in an organism's genome contain all of the information needed for the cell to conduct the myriad metabolic processes necessary for the organism's survival – information that is propagated from generation to generation. To have a basic understanding of how the collection of individual nucleotide bases drives the engine of life, large amounts of sequence data must be collected and stored in a way that these data can be searched and analyzed easily. To this end, much effort has gone into the design and maintenance of biological sequence databases. These databases have had a significant impact on the advancement of our understanding of biology not just from a computational standpoint but also through their integrated use alongside studies being performed at the bench.
The history of sequence databases began in the early 1960s, when Margaret Dayhoff and colleagues (1965) at the National Biomedical Research Foundation (NBRF) collected all of the protein sequences known at that time – all 65 of them – and published them in a book called the Atlas of Protein Sequence and Structure. It is important to remember that, at this point in the history of biology, the focus was on sequencing proteins through traditional techniques such as the Edman degradation rather than on sequencing DNA, hence the overall small number of available sequences. By the late 1970s, when a significant number of nucleotide sequences became available, those were also included in later editions of the Atlas. As this collection evolved, it included text-based descriptions to accompany the protein sequences, as well as information regarding the evolution of many protein families. This work, in essence, was the first annotated sequence database, even though it was in printed form. Over time, the amount of data contained in the Atlas became unwieldy and the need for it to be available in electronic form became obvious. From the early 1970s to the late 1980s, the contents of the Atlas were distributed electronically by NBRF (and later by the Protein Information Resource, or PIR) on magnetic tape, and the distribution included some basic programs that could be used to search and evaluate distant evolutionary relationships.
The next phase in the history of sequence databases was precipitated by the veritable explosion in the amount of nucleotide sequence data available to researchers by the end of the 1970s. To address the need for more robust public sequence databases, the Los Alamos National Laboratory (LANL) created the Los Alamos DNA Sequence Database in 1979, which became known as GenBank in 1982 (Benson et al. 2018). Meanwhile, the European Molecular Biology Laboratory (EMBL) created the EMBL Nucleotide Sequence Data Library in 1980. Throughout the 1980s, EMBL (then based in Heidelberg, Germany), LANL, and (later) the National Center for Biotechnology Information (NCBI, part of the National Library of Medicine at the National Institutes of Health) jointly contributed DNA sequence data to these databases. This was done by having teams of curators manually transcribing and interpreting what was published in print journals to an electronic format more appropriate for computational analyses. The DNA Databank of Japan (DDBJ; Kodama et al. 2018) joined this DNA data-collecting collaboration a few years later. By the late 1980s, the quantity of DNA sequence data being produced was so overwhelming that print journals began asking scientists to electronically submit their DNA sequences directly to these databases, rather than publishing them in printed journals or papers. In 1988, after a meeting of these three groups (now referred to as the International Nucleotide Sequence Database Collaboration, or INSDC; Karsch-Mizrachi et al. 2018), there was an agreement to use a common data exchange format and to have each database update only the records that were directly submitted to it. Thanks to this agreement, all three centers (EMBL, DDBJ, and NCBI) now collect direct DNA sequence submissions and distribute them so that each center has copies of all of the sequences, with each center acting as a primary distribution center for these sequences. DDBJ/EMBL/GenBank records are updated automatically every 24 hours at all three sites, meaning that all sequences can be found within DDBJ, the European Nucleotide Archive (ENA; Silvester et al. 2018), and GenBank in short order. That said, each database within the INSDC has the freedom to display and annotate the sequence data as it sees fit.
In parallel with the early work being done on DNA sequence databases, the foundations for the Swiss-Prot protein sequence database were also being laid in the early 1980s by Amos Bairoch, recounting its history from an engaging perspective in a first-person review (Bairoch 2000). Bairoch converted PIR's Atlas to a format similar to that used by EMBL for its nucleotide database. In this initial release, called PIR+, additional information about each of the proteins was added, increasing its value as a curated, well-annotated source of information on proteins. In the summer of 1986, Bairoch began distributing PIR+ on the US BIONET (a precursor to the Internet), renaming it Swiss-Prot. At that time, it contained the grand sum of 3900 protein sequences. This was seen as an overwhelming amount of data, in stark contrast to today's standards. As Swiss-Prot and EMBL followed similar formats, a natural collaboration developed between these two groups, and these collaborative efforts strengthened when both EMBL's and Swiss-Prot's operations were moved to EMBL's European Bioinformatics Institute (EBI; Cook et al. 2018) in Hinxton, UK. One of the first collaborative projects undertaken by the Swiss-Prot and EMBL teams was to create a new and much larger protein sequence database supplement to Swiss-Prot. As maintaining the high quality of Swiss-Prot entries was a time-consuming process involving extensive sequence analysis and detailed curation by expert annotators (Apweiler 2001), and to allow the quick release of protein data not yet annotated to Swiss-Prot's stringent standards, a new database called TrEMBL (for “translation of EMBL nucleotide sequences”) was created. This supplement to Swiss-Prot initially consisted of computationally annotated sequence entries derived from the translation of all coding sequences (CDSs) found in INSDC databases. In 2002, a new effort involving the Swiss Institute of Bioinformatics, EMBL-EBI, and PIR was launched, called the UniProt consortium (UniProt Consortium 2017). This effort gave rise to the UniProt Knowledgebase (UniProtKB), consisting of Swiss-Prot, TrEMBL, and PIR. A similar effort also gave rise to the NCBI Protein Database, bringing together data from numerous sources and described more fully in the text that follows.
The completion of human genome sequencing and the sequencing of numerous model genomes, as well as the existence of a gargantuan number of sequences in general, provides a golden opportunity for biological scientists, owing to the inherent value of these data. At the same time, the sheer magnitude of data also presents a conundrum to the inexperienced user, resulting not just from the size of the “sequence information space” but from the fact that the information space continues to get larger by leaps and bounds. Indeed, the sequencing landscape has changed significantly in recent years with the development of new high-throughput technologies that generate more and more sequence data in a way that is best described as “better, cheaper, faster,” with these advances feeding into the “insatiable appetite” that scientists have for more and more sequence data (Green et al. 2017). Given the inherent value of the data contained within these sequence databases, this chapter will focus on providing the reader with a solid understanding of these major public sequence databases, as a first step toward being able to perform robust and accurate bioinformatic analyses.