Читать книгу Bioinformatics - Группа авторов - Страница 19
The Feature Table
ОглавлениеEarly on in the collaboration between INSDC partner organizations, an effort was made to come up with a common way to represent the biological information found within a given database record. This common representation is called the feature table, consisting of feature keys (a single word or abbreviation indicating the described biological property), location information denoting where the feature is located within the sequence, and additional qualifiers providing additional descriptive information about the feature. The online INSDC feature table documentation is extensive and describes in great detail what features are allowed and what qualifiers can be used with each individual feature. Wording within the feature table uses common biological research terminology wherever possible and is consistent between DDBJ, ENA, and GenBank entries.
Here, we will dissect the feature table for the eukaryotic transcription factor 4E gene from Drosophila melanogaster, shown in its entirety in both Appendices 1.3 (in ENA format) and 1.4 (in DDBJ/GenBank format). This particular sequence is alternatively spliced, producing two distinct gene products, 4E-I and 4E-II. The first block of information in the feature table is always the source feature, indicating the biological source of the sequence and additional information relating to the entire sequence. This feature must be present in all INSDC entries, as all DNA or RNA sequences derive from some specific biological source, including synthetic DNA.
FT source 1..2881 FT /organism="Drosophila melanogaster" FT /chromosome="3" FT /map="67A8-B2" FT /mol_type="genomic DNA" FT /db_xref="taxon:7227" FT gene 80..2881 FT /gene="eIF4E"
In the first line of the source key, notice that the numbering scheme shows the range of positions covered by this feature key as two numbers separated by two dots (1..2881). As the source key pertains to the entire sequence, we can infer that the sequence described in this entry is 2881 nucleotides in length. The various ways in which the location of any given feature can be indicated are shown in Table 1.1, accounting for a wide range of biological scenarios. The qualifiers then follow, each preceded by a slash. The full scientific name of the organism is provided, as are specific mapping coordinates, indicating that this sequence is at map location 67A8-B2 on chromosome 3. Also indicated is the type of molecule that was sequenced (genomic DNA). Finally, the last line indicates a database cross-reference (abbreviated as db_xref) to the NCBI taxonomy database, where taxon 7227 corresponds to D. melanogaster. In general, these cross-references are controlled qualifiers that allow entries to be connected to an external database, using an identifier that is unique to that external database. Following the source block above is the gene feature, indicating that the gene itself is a subset of the entire sequence in this entry, starting at position 80 and ending at position 2881.
FT mRNA join(80..224,892..1458,1550..1920,1986..2085,2317..2404, FT 2466..2881) FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-I" FT mRNA join(80..224,1550..1920,1986..2085,2317..2404,2466..2881) FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-II"
Table 1.1 Indicating locations within the feature table.
345 | Single position within the sequence |
345..500 | A continuous range of positions bounded by and including the indicated positions |
<345..500 | A continuous range of positions, where the exact lower boundary is not known; the feature begins somewhere prior to position 345 but ends at position 500 |
345..>500 | A continuous range of positions, where the exact upper boundary is not known; the feature begins at position 345 but ends somewhere after position 500 |
<1..888 | The feature starts before the first sequenced base and continues to position 888 |
(102.110) | Indicates that the exact location is unknown, but that it is one of the positions between 102 and 110, inclusive |
123^124 | Points to a site between positions 123 and 124 |
123^177 | Points to a site between two adjacent nucleotides or amino acids anywhere between positions 123 and 177 |
join(12..78,134..202) | Regions 12–78 and 134–202 are joined to form one contiguous sequence |
complement(4918..5126) | The sequence complementary to that found from 4918 to 5126 in the sequence record |
J00194:100..202 | Positions 100–202, inclusive, in the entry in this database having accession number J00194 |
The next feature in this example indicates which regions form the two mRNA transcripts for this gene, the first for eukaryotic initiation factor 4E-I and the second for eukaryotic initiation factor 4E-II. In the first case (shown above), the join
line indicates that six distinct DNA segments are transcribed to form the mature RNA transcript while, in the second case, the second region is missing, with only five distinct DNA segments transcribed into the mature RNA transcript – hence the two splice variants that are ultimately encoded by this molecule.
FT CDS join(201..224,1550..1920,1986..2085,2317..2404,2466..2629) FT /codon_start=1 FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-II" FT /note="Method: conceptual translation with partial peptide FT sequencing" FT /db_xref="GOA:P48598" FT /db_xref="InterPro:IPR001040" FT /db_xref="InterPro:IPR019770" FT /db_xref="InterPro:IPR023398" FT /db_xref="PDB:4AXG" FT /db_xref="PDB:4UE8" FT /db_xref="PDB:4UE9" FT /db_xref="PDB:4UEA" FT /db_xref="PDB:4UEB" FT /db_xref="PDB:4UEC" FT /db_xref="PDB:5ABU" FT /db_xref="PDB:5ABV" FT /db_xref="PDB:5T47" FT /db_xref="PDB:5T48" FT /db_xref="UniProtKB/Swiss-Prot:P48598" FT /protein_id="AAC03524.1" FT /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGE FT PAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTVED FT FWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDNLWL FT DVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDALRLGR FT NNSLQYQLHKDTMVKQGSNVKSIYTL"
Following the mRNA feature is the CDS feature shown above, describing the region that ultimately encodes the protein product. Focusing just on eukaryotic initiation factor 4E-II, the CDS feature also shows a join
line with coordinates that are slightly different from those shown in the mRNA feature, specifically at the beginning and end positions. The difference lies in the fact that the 5′ and 3′ untranslated regions (UTRs) are included in the mRNA feature but not in the CDS feature. The CDS feature corresponds to the sequence of amino acids found in the translated protein product whose sequence is shown in the /translation
qualifier above. The /codon_start
qualifier indicates that the amino acid translation of the first codon begins at the first position of this joined region, with no offset.
The /protein_id
qualifier shows the accession number for the corresponding entry in the protein databases (AAC03524.1) and is hyperlinked, enabling the user to go directly to that entry. These unique identifiers use a “3 + 5” format – three letters, followed by five numbers. Versions are indicated by the decimal that follows; when the protein sequence in the record changes, the version is incremented by one. The assignment of a gene product or protein name (via the /protein
qualifier) often is subjective, sometimes being assigned via weak similarities to other (and sometimes poorly annotated) sequences. Given the potential for the transitive propagation of poor annotations (that is, bad data tend to beget more bad data), users are advised to consult curated nucleotide and protein sequence databases for the most up-to-date, accurate information regarding the putative function of a given sequence. Finally, notice the extensive cross-referencing via the /db_xref
qualifier to entries in InterPro, the Protein Data Bank (PDB), and UniProtKB/Swiss-Prot, as well as to a Gene Ontology annotation (GOA; Gene Ontology Consortium 2017).
Implicit in the source feature and the organism that is assigned to it is the genetic code used to translate the nucleic acid sequence into a protein sequence when a CDS feature is present in the record. Also, the DNA-centric nature of these feature tables means that all features are mapped through a DNA coordinate system, not that of amino acid reference points, as shown in the examples in Appendices 1.3 and 1.4.
SQ Sequence 2881 BP; 849 A; 699 C; 585 G; 748 T; 0 other; cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca 60 acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt 120 cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata 180 . .<truncated for brevity> . aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta 2820 gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg 2880 g 2881 //
Finally, at the end of every nucleotide sequence record, one finds the actual nucleotide sequence, with 60 bases per row. Note that, in the SQ line signaling the beginning of this section of the record, not only is the overall length of the sequence provided, but a count of how many of each individual type of nucleotide base is also provided, making it quite easy to compute the GC content of this sequence.