Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 50
3.1.5 Information Measures Recap
ОглавлениеThe fundamental information measures are, thus, Shannon entropy, mutual information, and relative entropy (also known as the Kullback–Leibler divergence, especially when in symmetrized form). Shannon entropy, σ = –∑xp(x)log(p(x)), is a measure of the information in distribution p(x). Relative entropy (Kullback–Leibler divergence): ρ = ∑xp(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. MI(X, Y), μ = ∑x∑yp(xy) log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. As shown above, Mutual Information is a special case of relative entropy. Let us now write code to implement these measures, and then apply them to analysis of genomic data.
The next program, cleverly named prog2.py, will build off the code devised previously, with the file i/o operation now “lifted” into a subroutine for safer encapsulation (to avoid scope errors, etc.) and to avoid the confusing clutter of copying and pasting such a large block of code repeatedly that would be required otherwise. By now, this has hopefully made a convincing case for why subroutines are a big deal in the evolution of software engineering constructs (and the computer languages that implement them). Further discussion is given in the comments in the code.
------------------------ prog2.py --------------------------- #!/usr/bin/python import numpy as np import math import re # from prior code we carry over the subroutines: # shannon, count_to_freq, Shannon_order; # with prototypes: # def shannon( probs ) with usage: # value = shannon(probs) # print(value) # # def count_to_freq( counts ) with usage # probs = count_to_freq(rolls) # print(probs) # # def shannon_order( seq, order ) with usage: # order = 8 # maxcounts = 4**order # print "max counts at order", order, "is =", maxcounts # val = math.log(maxcounts) # shannon_order(sequence,order) # shannon_order prints entropy # print "The max entropy would be log(4**order) = ", val # New code is now created to have subroutines for text handling. # There are two types of text-read, one for genome data in "fasta" # format (gen_fasta_read) and one for generic format (gen_txt_read): def gen_txt_read( text ): if (text == ""): text = "Norwalk_Virus.txt" fo = open(text, "r+") str = fo.read() fo.close() pattern = '[acgt]' result = re.findall(pattern, str) # seqlen = len(result) return result #usage null="" gen_array = gen_txt_read(null) # defaults, uses Norwalk_Virus.txt genome sequence = "" sequence = sequence.join(gen_array) Norwalk_sequence = sequence; seqlen = len(gen_array) print "The sequence length of the Norwalk genome is:", seqlen def gen_fasta_read( text ): if (text == ""): text = "EC_Chr1.fasta.txt" slurp = open(text, 'r') lines = slurp.readlines() sequence = "" for line in lines: pattern = '>' test = re.findall(pattern,line) testlen = len(test) if testlen>0: print "fasta comment:", line.strip() else: sequence = sequence + line.strip() slurp.close() pattern = '[acgtACGT]' result = re.findall(pattern, sequence) return result #usage null="" fasta_array = gen_fasta_read(null) # defalus, uses e.coli genome sequence = "" sequence = sequence.join(fasta_array) EC_sequence = sequence seqlen = len(fasta_array) print "The sequence length of the e. coli Chr1 genome is:", seqlen print "Doing order 8 shannon analysis on e-coli:" order = 6 shannon_order(EC_sequence,order) --------------------- prog2.py end --------------------------
As mentioned previously, when comparing two probability distributions on the same set of outcomes, it is natural to ask if they can be compared in terms of the difference in their scalar‐valued Shannon entropies. Similarly, there is the standard manner of comparing multicomponent features by treating them as points in a manifold and performing the usual Euclidean distance calculation generalized to whatever dimensionality of the feature data. Both of these approaches are wrong, especially the latter, when comparing discrete probability distributions (of the same dimensionality). The reason being, when comparing two discrete probability distributions there are the additional constraint on the probabilities (sum to 1, etc.), and the provably optimal difference measure under these circumstances, as described previously, is relative entropy. This will be explored in Exercise 3.5, so some related subroutines are included in the first addendum to prog2.py:
--------------------- prog2.py addendum 1-------------------- # We can use the Shannon_order subroutine to return a probability array # for a given sequence. Here are the probability arrays on 3-mers # (ordered alphabetically): Prob_Norwalk_3mer = shannon_order(Norwalk_sequence,3) Prob_EC_3mer = shannon_order(EC_sequence,3) # the standard Euclidean distance and relative entropy are given next def eucl_dist_sq ( P , Q ): Pnum = len(P) Qnum = len(Q) if Pnum != Qnum: print "error: Pnum != Qnum" return -1 euclidean_distance:squared = 0 for index in range(0, Pnum): euclidean_distance:squared += (P[index]-Q[index])**2 return euclidean_distance:squared # usage value = eucl_dist_sq(Prob_Norwalk_3mer,Prob_EC_3mer) print "The euclidean distance squared between EC and Nor 3mer probs is", value # P and Q are probability arrays, meaning components are positive definite # and sum to one. If P and Q are proability arrays, can compare them in # terms of relative entropy (not Euclidean distance): def relative_entropy ( P , Q ): Pnum = len(P) Qnum = len(Q) if Pnum != Qnum: print "error: Pnum != Qnum" return -1 rel_entropy = 0 for index in range(0, Pnum): rel_entropy += P[index]*math.log(P[index]/Q[index]) return rel_entropy # usage value1 = relative_entropy(Prob_Norwalk_3mer,Prob_EC_3mer) print "The relative entropy between Nor and EC 3mer probs is", value1 value2 = relative_entropy(Prob_EC_3mer,Prob_Norwalk_3mer) print "The relative entropy between EC and Nor 3mer probs is", value2 sym = (value1+value2)/2 print "The symmetrized relative entropy between EC and Nor 3mer probs is", sym -------------------- prog2.py addendum 1end ------------------
Recall that the definition of mutual information between two random variables, {X, Y} is simply the relative entropy between P(X, Y) “P” and P(X)P(Y) “p”, and this is implemented in addendum #2 to prog2.py:
------------------- prog2.py addendum 2 ---------------------- # the mutual information between P(X,Y) and p(X)p(Y) requires passing two # prob arrays: P and p, where |P| = |p|^2 is relation between # of terms: def mutual_info ( P , p ): Pnum = len(P) pnum = len(p) if Pnum != pnum*pnum: print "error: Pnum != pnum*pnum" return -1 mi = 0 for index in range(0, Pnum): row = index/pnum column = index%pnum mi += P[index]*math.log(P[index]/(p[row]*p[column])) return mi #usage Prob_EC_2mer = shannon_order(EC_sequence,2) Prob_EC_1mer = shannon_order(EC_sequence,1) mutual_info(Prob_EC_2mer,Prob_EC_1mer) ----------------- prog2.py addendum 2 end -------------------