Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 53
3.3.1 Ab initio Learning with smORF’s, Holistic Modeling, and Bootstrap Learning
ОглавлениеIn work on prokaryotic gene prediction (V. cholera in what follows), a program (smORF) was developed for an extended ORF characterization (to characterize “some more ORFs” with different trinucleotide delimiters than stops). Using that software with a simple start‐of‐coding heuristic it was possible to establish good gene prediction for ORFs of length greater than 500 nucleotides. The smORF gene identification was used in a bootstrap gene‐annotation process (where no initial training data was provided). Part of the functionality for smORF is encompassed in prog2.py program described thus far. The strength of the gene identification was then improved by use of a gap‐interpolating‐Markov‐model (gIMM’s to be described in Section 3.4). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMM‐scoring coding regions were rejected, performance improved, with results slightly better than those of the early Glimmer gene‐prediction software [125] , where an interpolating Markov model was used (but not generalized to permit gaps). More recent versions of Glimmer incorporate start‐codon modeling in order to strengthen predictions. One of the benefits of the gap‐interpolating generalization is that it permits regulatory motifs to be identified, particularly those sharing a common positional alignment with the start‐of‐coding region. Using the bootstrap‐identified genes from the smORF‐based gene‐prediction (including mis‐calls) as a training set permitted an unsupervised search for upstream regulatory structure. The classic Shine‐Dalgarno sequence (the ribosome binding site) was found to be the strongest signal in the 30‐base window upstream from the start codon. Similar results will be found with the full gene‐finder example in Chapter 4.
Before moving on to more sophisticated gene structure identification (Chapter 4), let us first consider the multi‐frame and two‐strand aspect of the genomic information and what this might mean for the “topology” or overlap placement of coding regions. To recap, smORF offers information about ORFs, and tallies information about other such codon void regions (an ORF is a void in three codons: TAA, TAG, TGA). This allows for a more informed selection process when sampling from a genome, such that non‐overlapping gene starts can be cleanly and unambiguously sampled. Furthermore, overlapping ORF coding regions can be identified and enumerated (see Figures 3.3 and 3.4).
The goal with smORF was, initially, to identify key gene structures (e.g. stop codons, etc.) and use only the highest confidence examples to train profilers. Once this was done, Markov models (MMs) were (bootstrap) constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterated again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude gene‐finder was then constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first in‐frame “atg” (to be implemented in Chapter 4). This analysis was applied to the Vibrio cholerae genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first‐“atg” heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the start‐codon relative to the first‐“atg” (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic gene‐finder: information is needed to help identify the correct start codon in a 50‐base window from the first ATG. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control). Further bootstrap refinements along these lines are done in Chapter 4 to produce an ab initio prokaryotic gene finder with 99.9% or better accuracy.
Figure 3.3 (a) Topology index histograms shown for the V. cholerae CHR. I genome, where the x‐axis is the topology index, and the y‐axis shows the event counts (i.e. occurrence of that particular topology index in the genome). The topology index is computed by the following scheme: (i) initialize index for all bases in sequence to zero. (ii) Each base in a forward sense ORF, with length greater than a specified cutoff, is incremented by +10 000 for each such ORF overlap. Similarly, bases in reverse sense ORFs are incremented by +1000 for each such overlap. Voids larger than the cutoff length in the nonstandard smORFs each give rise to an increment of +1. The top panel above shows that V. cholerae only has a small portion of its genome involved in multiple gene encodings. The (b) panel shows a “blow‐up” of the 10 000 peak.
Figure 3.4 Topology‐index histograms are shown for the Chlamydia trachomatis genome, (a), and Deinococcus radiodurans genome, (b) C. trachomatis, like V. cholerae, shows very little overlapping gene structure. D. radiodurans, on the other hand, is dominated by genes that overlap other genes (note the strong 11 000 peak).
Ab initio gene‐finding can identify the stop codons and, thus, (standard) ORFs. A generalization to codon void regions, with all six frame passes, also leads to recognition of different, overlapping, potential gene regions (and then doubled given the two orientations). A genome‐topology scoring as shown in Figure 3.3 can clearly show differences between bacteria (Figure 3.4) – and is thus a possible “fingerprinting” tool.
The prokaryotic genome analysis is similar to both the prokaryotic and eukaryotic transciptome analysis (where eukaryotic transcriptome analysis is similar since the introns have been removed). The analysis tools for prokaryotic genomes, described thus far, are primarily what are needed for either prokaryotic or eukaryotic transcriptome analysis. Surprisingly, the same overlapping void topologies, with reverse overlap orientation (“duals”), are seen at transcriptome level in eukaryotes as in prokaryotes. For eukaryotic transcripts with overlaps that are “dual”, however, this has special significance. Recall that a transcript that encodes overlapping read direction “duality” (with regulatory regions intact and lengthy ORF size, so highly likely functional), is only from a single genome‐level pre‐messenger ribonucleic acid (mRNA) due to intron splicing in eukaryotes. This is a very odd arrangement (artifact) for eukaryotes unless they evolved from an ancient prokaryote as hypothesized in a number of theories where such an overlap topology would already be in place to “imprint thru.” The specific nature of this transcriptome artifact, however, is best explained via the viral eukaryogenesis hypothesis (see [1, 3]).