Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 51

3.2 Codon Discovery from Mutual Information Anomaly

As mentioned previously, mutual information allows statistical linkages to be discovered that are not otherwise apparent. Consider the mutual information between nucleotides in genomic data when different gap sizes are considered between the nucleotides as shown in Figure 3.1a. When the MI for different gap sizes is evaluated (see Figure 3.1b), a highly anomalous long‐range statistical linkage is seen, consistent with a three‐element encoding scheme (the codon structure is thereby revealed) [1, 3].

To do the computation for Figure 3.1, the next subroutine (prog2.py addendum 3) is a recycled version of the code for the dinucleotide counter described previously (prog1.py addendum 6), except that now the two bases in the sample window have a fixed gap size between them of the indicated size. Before, with no gap, the gapsize was zero.

------------------- prog2.py addendum 3 --------------------- def gap_dinucleotide( seq, gap ): stats = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) probs = np.empty((0)) if (1+gap>seqlen): print "error, gap > seqlen" return probs for index in range(1+gap,seqlen): xmer = result[index-1-gap]+result[index] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) return probs # usage: for i in range(0,25): gap_probs = gap_dinucleotide(EC_sequence,i) val=mutual_info(gap_probs,Prob_EC_1mer) print "Gap", i, "has mi=", val ----------------- prog2.py addendum 3 end --------------------

Figure 3.1 Codon structure is revealed in the V. cholera genome by mutual information between nucleotides in the genomic sequence when evaluated for different gap sizes.

In Figure 3.1, at first the mutual information falls off as we look at statistical linkages at greater and greater distance, which makes sense for any information construct that is “local,” e.g. it should have less linkage with structure further away. After a certain point, however, the mutual information no longer falls off, instead cycling back to a certain level of mutual information with a cycle period of three bases. This suggests that a long‐range three‐element encoding scheme might exist (among other things), which can easily be tested. In doing so we ask Nature “the right question” and the answer is the rediscovery of the codon encoding scheme, as will be shown in what follows.

So, to clarify before proceeding, suppose we want to get information on a three‐element encoding scheme for the Escherichia coli genome (Chromosome 1), say, in file EC_Chr1.fasta.txt. We, therefore, want an order = 3 oligo counting, but on 3‐element windows seen “stepping” across the genome, e.g. a “stepping” window for sampling, not a sliding window, resulting in three choices on stepping, or framing, according to how you take your first step:

case 0: agttagcgcgt ‐‐> (agt)(tag)(cgc)gt

case 1: agttagcgcgt ‐‐> a(gtt)(agc)(gcg)t

case 2: agttagcgcgt ‐‐> ag(tta)(gcg)(cgt)

In the code that follows we get codon counts for a particular frame‐pass (prog2.py addendum 4):

------------------- prog2.py addendum 4 --------------------- # so suspect existence of three-element coding scheme, the codon, # so need stats (anomolous) on codons.... # 'frame' specifies the frame pass as case 0, 1, or 2 in the text. # getting codon counts for a specified framing and specified sequence # will now be shown in two ways, one built from re-use of code blocks, # one from re-use of an entire subroutine: def codon_counter ( seq, frame ): codon_counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon in codon_counts: codon_counts[codon]+=1 else: codon_counts[codon]=1 counts = np.empty((0)) for i in sorted(codon_counts): counts = np.append(counts,codon_counts[i]+0.0) print "codon", i, "count =", codon_counts[i] probs = count_to_freq(counts) return probs codon_counter(EC_sequence,0) # could also get codon counts by shannon_order with modification to step # (and have order=3 for codon: def shannon_codon( seq, frame ): order=3 stats = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) for index in range(order-1+frame,seqlen): if index%3!=2: continue xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+frame+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%d %s" % (stats[i],i)) counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) return probs shannon_codon(EC_sequence,0) ---------------- prog2.py addendum 4 end -------------------

In running prog2.py addendum 4 we find that the codon “tag” has much lower counts, and similarly for the codon “cta”:

frame 0 have tag with 8970 and cta with 8916

frame 1 have tag with 9407 and cta with 8821

frame 2 have tag with 8877 and cta with 9033

The tag and cta trinucleotides happen to be related – they are reverse compliments of each other (the first hint of information encoding via duplex deoxyribonucleic acid (DNA) with Watson–Crick base‐pairing). There are two other notably rare codons: taa and tga (and their reverse compliment in this all‐frame genome‐wide study as well).

Now that we have identified an interesting feature, such as “tag,” it is reasonable to ask about this feature’s placement across the genome. Having done that, the follow‐up is to identify any anomalously recurring feature proximate to the feature of interest. Such an analysis would need a generic subroutine for getting counts on sub‐strings of indicated order on an indicated reference, to genome sequence data, and that is provided next as an addendum #5 to prog2.py.

-------------------- prog2.py addendum 5 -------------------- # see that 'tag' is anomolous, want to get sense of the distribution on # gaps between 'tag' (still satepping 'in-frame'). def codon_gap_counter ( seq, frame, delimiter ): counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) oldindex=0 for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon!=delimiter: continue else: gap = index - oldindex quant = 100 bin = gap/quant if oldindex!=0: if bin in counts: counts[bin]+=1 else: counts[bin]=1 oldindex=index npcounts = np.empty((0)) for i in sorted(counts): npcounts = np.append(npcounts,counts[i]+0.0) print "gapbin", i, "count =", counts[i] probs = count_to_freq(npcounts) return probs # usage: delimiters = ("AAA","AAC","AAG","AAT","ACA","ACC","ACG","ACT", "AGA","AGC","AGG","AGT","ATA","ATC","ATG","ATT", "CAA","CAC","CAG","CAT","CCA","CCC","CCG","CCT", "CGA","CGC","CGG","CGT","CTA","CTC","CTG","CTT", "GAA","GAC","GAG","GAT","GCA","GCC","GCG","GCT", "GGA","GGC","GGG","GGT","GTA","GTC","GTG","GTT", "TAA","TAC","TAG","TAT","TCA","TCC","TCG","TCT", "TGA","TGC","TGG","TGT","TTA","TTC","TTG","TTT") for delimiter in delimiters: print "\n\ndelimiter is", delimiter codon_gap_counter(EC_sequence,0,delimiter) ----------------- prog2.py addendum 5 end ------------------

Upon running the above code with codon delimiter set to “tag,” we arrive at Table 3.1, which shows the distribution on (tag) gap sizes. Bin size is 100. So gap bin 0 has the count on all gaps seen sized anywhere from 1 to 99. Bin 1 has counts on occurrences of gaps in the domain 100–199, etc.

Table 3.1 (tag) Gap sizes, with bin size 100.

Gap bin	Count
0	2115
1	1428
2	1066
3	829
4	696
5	484
6	399
7	293
8	241
9	222

Table 3.2 (aaa) Gap sizes, with bin size 100.

Gap bin	Count
0	21 256
1	7843
2	3375
3	1665
4	827
5	480
6	287
7	163
8	86
9	70

In order to see how strongly the (tag) distribution is skewed, we consider some other codon to evaluate, such as for the “aaa” gap, where aaa is most common. The aaa gaps, shown in Table 3.2, tend to be much smaller, with a standard exponential distribution fall‐off indicative of no long‐range encoding linkages:

Thus the codon tag is clearly very different from aaa, it is as if tag roughly marks the boundaries of regions, and aaa is just scattered throughout. Are any other codons similar to tag? The frequency analysis blurs counts so more subtle differences not as obvious have to run gap counter for each to directly see, and how to easily “see”? Notice how the tag distribution has a long tail. The gap bins only go to 9 in the figures, but for the full dataset the last nonzero gap bin for “tag” is at a remarkable bin 70. For “aaa” the last nonzero bin is much earlier, at bin number 23 (even though there are 10 times as many (aaa) codons as (tag) codons). For “taa” the last nonzero bin is at 60, while for tga it is at 53. The codons taa, tga, and tag are known as the stop codons and the gaps between them are known as ORFs, or ORFs. A subtlety in the statistical analysis is that the stop codons do not have to match to define such anomalously large regions (according to observation). Thus, a biochemical encoding scheme must exist that works with any of the three stop codons seen as equivalent, thus the naming for this group as “stop” codons (and their grouping as such in Figure 3.2). For more nuances of the naming convention “stop” codon when delving into the encoding biochemistry see [1, 3].

Figure 3.2 ORF encoding structure is revealed in the V. cholera genome by gaps between stop codons in the genomic sequence. X‐axis shows the size of the gap in codon count between reference codons (stops for conventional ORFs, or 3com set for comparisons in table), Y‐axis shows the counts.

Подняться наверх