Читать книгу Bioinformatics - Группа авторов - Страница 18

Box 1.1 Functional Divisions in Nucleotide Databases

The organization of nucleotide sequence records into discrete functional types provides a way for users to query specific subsets of the records within these databases. In addition, knowledge that a particular sequence is from a given technique-oriented database allows users to interpret the data from the proper biological point of view. Several of these divisions are described below, and examples of each of these functional divisions (called “data classes” by ENA) can be found by following the example links listed on the ENA Data Formats page listed in the Internet Resources section of this chapter.

CON	Constructed (or “contigged”) records of chromosomes, genomes, and other long DNA sequences resulting from whole -genome sequencing efforts. The records in this division do not contain sequence data; rather, they contain instructions for the assembly of sequence data found within multiple database records.
EST	Expressed Sequence Tags. These records contain short (300–500 bp) single reads from mRNA (cDNA) that are usually produced in large numbers. ESTs represent a snapshot of what is expressed in a given tissue or at a given developmental stage. They represent tags – some coding, some not – of expression for a given cDNA library.
GSS	Genome Survey Sequences. Similar to the EST division, except that the sequences are genomic in origin. The GSS division contains (but is not limited to) single-pass read genome survey sequences, bacterial artificial chromosome (BAC) or yeast artificial chromosome (YAC) ends, exon-trapped genomic sequences, and Alu polymerase chain reaction (PCR) sequences.
HTG	High-Throughput Genome sequences. Unfinished DNA sequences generated by high-throughput sequencing centers, made available in an expedited fashion to the scientific community for homology and similarity searches. Entries in this division contain keywords indicating its phase within the sequencing process. Once finished, HTG sequences are moved into the appropriate database taxonomic division.
STD	A record containing a standard, annotated, and assembled sequence.
STS	Sequence-Tagged Sites. Short (200–500 bp) operationally unique sequences that identify a combination of primer pairs used in a PCR assay, generating a reagent that maps to a single position within the genome. The STS division is intended to facilitate cross-comparison of STSs with sequences in other divisions for the purpose of correlating map positions of anonymous sequences with known genes.
WGS	Whole-Genome Shotgun sequences. Sequence data from projects using shotgun approaches that generate large numbers of short sequence reads that can then be assembled by computer algorithms into sequence contigs, higher -order scaffolds, and sometimes into near-chromosome- or chromosome-length sequences.

Following the ID line are one or more date lines (denoted by DT), indicating when the entry was first created or last updated. For our sequence of interest, the entry was originally created on May 19, 1996 and was last updated in ENA on June 23, 2017:

DT 19-MAY-1996 (Rel. 47, Created) DT 23-JUN-2017 (Rel. 133, Last updated, Version 5)

The release number in each line indicates the first quarterly release made after the entry was created or last updated. The version number for the entry appears on the second line and allows the user to determine easily whether they are looking at the most up-to-date record for a particular sequence. Please note that this is different from the accession.version format described above – while some element of the record may have changed, the sequence may have remained the same, so these two different types of version numbers may not always correspond to one another.

The next part of the header contains the definition lines, providing a succinct description of the kinds of biological information contained within the record. The definition line (DE in ENA, DEFINITION in DDBJ/GenBank) takes the following form.

DE Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, DE complete cds, alternatively spliced.

Much care is taken in the generation of these definition lines and, although many of them can be generated automatically from other parts of the record, they are reviewed to ensure that consistency and richness of information are maintained. Obviously, it is quite impossible to capture all of the biology underlying a sequence in a single line of text, but that wealth of information will follow soon enough in downstream parts of the same record.

Continuing down the flatfile record, one finds the full taxonomic information on the sequence of interest. The OS line (or SOURCE line in DDBJ/GenBank) provides the preferred scientific name from which the sequence was derived, followed by the common name of the organism in parentheses. The OC lines (or ORGANISM lines in DDBJ/GenBank) contain the complete taxonomic classification of the source organism. The classification is listed top-down, as nodes in a taxonomic tree, with the most general grouping (Eukaryota) given first.

OS Drosophila melanogaster (fruit fly) OC Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; OC Neoptera; Holometabola; Diptera; Brachycera; Muscomorpha; Ephydroidea; OC Drosophilidae; Drosophila; Sophophora.

Each record must have at least one reference or citation, noted within what are called reference blocks. These reference blocks offer scientific credit and set a context explaining why this particular sequence was determined. The reference blocks take the following form.

RN [1] RP 1-2881 RX DOI; .1074/jbc.271.27.16393. RX PUBMED; 8663200. RA Lavoie C.A., Lachance P.E., Sonenberg N., Lasko P.; RT "Alternatively spliced transcripts from the Drosophila eIF4E gene produce RT two different Cap-binding proteins"; RL J Biol Chem 271(27):16393-16398(1996). XX RN [2] RP 1-2881 RA Lasko P.F.; RT ; RL Submitted (09-APR-1996) to the INSDC. RL Paul F. Lasko, Biology, McGill University, 1205 Avenue Docteur Penfield, RL Montreal, QC H3A 1B1, Canada

In this case, two references are shown, one referring to a published paper and the other referring to the submission of the sequence record itself. In the example above, the second block provides information on the senior author of the paper listed in the first block, as well as the author's postal address. While the date shown in the second block indicates when the sequence (and accompanying information) was submitted to the database, it does not indicate when the record was first made public, so no inferences or claims based on first public release can be made based on this date. Additional submitter blocks may be added to the record each time the sequence is updated.

Some headers may contain COMMENT (DDBJ/GenBank) or CC (ENA) lines. These lines can include a great variety of notes and comments (descriptors) that refer to the entire record. Often, genome centers will use these lines to provide contact information and to confer acknowledgments. Comments also may include the history of the sequence. If the sequence of a particular record is updated, the comment will contain a pointer to the previous versions of the record. Alternatively, if an earlier version of the record is retrieved, the comment will point forward to the newer version, as well as backwards, if there was a still earlier version. Finally, there are database cross-reference lines (marked DR) that provide links to allied databases containing information related to the sequence of interest. Here, a cross-reference to FlyBase can be seen in the complete header for this record in Appendix 1.1. Note that the corresponding DDBJ/GenBank header in Appendix 1.2 does not contain these cross-references.

Подняться наверх