Читать книгу Bioinformatics - Группа авторов - Страница 27

Box 1.3 Ensuring the Continued Quality of Data in Public Sequence Databases

Оглавление

Given the roles of DDBJ, EMBL, and GenBank in maintaining the archive of all publicly available DNA, RNA, and protein sequences, the continued usefulness of this resource is highly dependent on the quality of data found within it. Despite the high degree of both manual and automated checking that takes place before a record becomes public, errors will still find their way into the databases. These errors may be trivial and have no biological consequence (e.g. an incorrect postal code), may be misleading (e.g. an organism having the correct genus but wrong species name), or downright incorrect (e.g. a full-length mRNA not having a CDS annotated on it). Sometimes, records may have incorrect reference blocks, preventing researchers from linking to the correct publication describing the sequence. Over time, many have taken an active role in reporting these errors but, more often than not, these errors are left uncorrected.

While the individual INSDC members have the responsibility for hosting and disseminating the data found within their databases, keep in mind that the ownership of the data rests with the original submitter – and these original submitters (or their designees) are the only ones who can make updates to their database records. To keep these community resources as accurate and up to date as possible, users are actively encouraged to report any errors found when using the databases in the course of their work so that the database administrators can follow up with the original submitters as appropriate.

Given below are the current e-mail addresses for submitting information regarding errors to the three major sequence databases. As all the databases share information with each other nightly, it is only necessary to report the error to one of the three members of the consortium. Authors are actively encouraged to check their own records periodically to ensure that the information they previously submitted is still accurate. Even though this charge to the community is discussed here in the context of the three major sequence databases, all databases provide similar mechanisms through which incorrect information can be brought to the attention of the database administrators.

DDBJ ddbjupdt@ddbj.nig.ac.jp
EMBL datasubs@ebi.ac.uk
GenBank gb-admin@ncbi.nlm.nih.gov

As alluded to above, the range of publicly available data obviously goes well beyond human data, whether sequence based or not. As the major public sequence databases need to be able to store data in a fairly generalized fashion, these databases often do not contain more specialized types of information that would be of interest to specific segments of the biological community. To address this, many smaller, specialized databases have emerged and have been developed and curated by biologists “in the trenches” to fulfill specific needs. These databases, which contain information ranging from strain crosses to gene expression data, provide a valuable adjunct to the more visible public sequence databases, and users are encouraged to make intelligent use of both types of databases. An annotated list of such databases can be found in the yearly Database issue of Nucleic Acids Research (Rigden and Fernández 2018).

The position of this chapter at the beginning of this book reflects the belief that an understanding of biological databases is the first step toward being able to perform robust and accurate bioinformatic analyses. The reader is very strongly encouraged to take the time to understand the structure of the data found within these databases, as the basis for finding sequence data of interest and performing the more advanced analyses described in the chapters that follow.

Bioinformatics

Подняться наверх