Читать книгу Bioinformatics - Группа авторов - Страница 33
Introduction
ОглавлениеOn April 14, 2003, the biological community celebrated the achievement of the Human Genome Project's major goal: the complete, accurate, and high-quality sequencing of the human genome (International Human Genome Sequencing Consortium 2001; Schmutz et al. 2004). The attainment of this goal, which many have compared to landing a person on the moon, has had a profound effect on how biological and biomedical research is conducted and will undoubtedly continue to have a profound effect on its direction in the future. The availability of not just human genome data, but also human sequence variation data, model organism sequence data, and information on gene structure and function provides fertile ground for biologists to better design and interpret their experiments in the laboratory, fulfilling the promise of bioinformatics in advancing and accelerating biological discovery.
One of the most important databases available to biologists is GenBank, the annotated collection of all publicly available DNA and protein sequences (Benson et al. 2017; see Chapter 1). This database, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), represents a collaborative effort between NCBI, the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). At the time of this writing, GenBank contained over 200 million sequences and over 300 trillion nucleotide bases. The completion of human genome sequencing and the sequencing of an ever-expanding number of model organism genomes, as well as the existence of a gargantuan number of sequences in general, provides a golden opportunity for biological scientists, owing to the inherent value of these data. However, at the same time, the sheer magnitude of data presents a conundrum to the inexperienced user, resulting not just from the size of the “sequence information space” but from the fact that the information space continues to get larger and larger – by leaps and bounds – at a pace that will continue to accelerate, even though human genome sequencing has long been “completed.”
The effect of the Human Genome Project and other systematic sequencing projects on the continued accumulation of sequence data is illustrated by the growth of GenBank, as shown in Figure 2.1; the exponential growth rate illustrated in the figure is expected to continue for some time to come. The continued expansion of not just the sequence space but of the myriad biological data now available because of the expansion of the sequence space underscores the necessity for all biologists to learn how to effectively navigate this information for effective use in their work – even allowing investigators to avoid performing expensive experiments themselves based on the data found within these virtual treasure troves.
GenBank (or any other biological database, for that matter) serves little purpose unless the data can be easily searched and entries retrievable in a useful, meaningful format. Otherwise, sequencing efforts such as those described above have no useful end – without effective search and retrieval tools, the biological community as a whole cannot make use of the information hidden within these millions of bases and amino acids, much less the structures they form or the mutations they harbor. Much effort has gone into making such data accessible to the biologist, and a selection of the programs and interfaces resulting from these efforts are the focus of this chapter. The discussion will center on querying databases maintained by NCBI, as these more “general” repositories are far and away the ones most often accessed by biologists, but attention will also be given to specialized databases that provide information not necessarily found through the use of Entrez, NCBI's integrated information retrieval system.
Figure 2.1 The exponential growth of GenBank in terms of number of nucleotides (squares, in millions) and number of sequences submitted (circles, in thousands). Source data for the figure have been obtained from the National Center for Biotechnology Information (NCBI) web site. Note that the period of accelerated growth after 1997 coincides with the completion of the Human Genome Project's genetic and physical mapping goals, setting the stage for high-accuracy, high-throughput sequencing, as well as the development of new sequencing technologies (Collins et al. 1998, 2003; Green et al. 2011).