Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 56
3.5 Exercises
Оглавление1 3.1 In Section 3.1, the Maximum Entropy Principle is introduced. Using the Lagrangian formalism, find a solution that maximizes on Shannon entropy subject to the constraint of the “probabilities” sum to one.
2 3.2 Repeat the Lagrangian optimization of (Exercise 3.1) subject to the added constraint that there is a mean value, E(X) = μ.
3 3.3 Repeat the Lagrangian optimization of (Exercise 3.2) subject to the added constraint that there is a variance value, Var(X) = E(X2)(E(X))2 = σ2.
4 3.4 Using the two‐die roll probabilities from (Exercise 2.3) compute the mutual information between the two die using the relative entropy form of the definition. Compare to the pure Shannon definition: MI(X, Y) = H(X) + H(Y) – H(X, Y).
5 3.5 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank) and select the genomes of three medius‐sized bacteria (~1 Mb), where two bacteria are closely related. Using the Python code shown in Section 2.1, determine their hexamer frequencies (as in Exercise 2.5 with virus genomes). What is the Shannon entropy of the hexamer frequencies for each of the three bacterial genomes? Consider the following three ways to evaluate distances between the genome hexamer‐frequency profiles (denoted Freq(genome1), etc.), try each, and evaluate their performance at revealing the “known” (that two of the bacteria are closely related):distance = Shannon difference = | H(Freq(genome1))−H(Freq(genome2))|.distance = Euclidean distance = d(Freq(genome1),Freq(genome2)).distance = Symmetrized Relative Entropy= [D(Freq(genome1)||Freq(genome2))+D(Freq(genome2)||Freq(genome1))]/2Which distance measure provides the clearest identification of phylogenetic relationship? Typically it should be (iii).
6 3.6 Exercise 3.5, if done repeatedly, will eventually reveal that the best distance measure (between distributions) is the symmetrized relative entropy (case (iii)). Notice that this means that when comparing two distributions we quantify their difference not by a difference on Shannon entropies, case (i). In other words, we choose:Difference(X, Y) = MI(X, Y) = H(X) + H(Y) – H(X, Y),Not Difference = ∣ H(X) – H(Y)∣.The latter case satisfies the metric properties, including triangle inequality, in order to be a “distance” measure, is this true for the mutual information difference as well?
7 3.7 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank) and select the genome of the K‐12 strain of E. coli. (The K‐12 strain was obtained from the stool sample of a diphtheria patient in Palo Alto, CA, in 1922, so that seems like a good one.) Reproduce the MI codon discovery described in Figure 3.1.
8 3.8 Using the E. coli genome (the one described above) and using the codon counter code, get the frequency of occurrence of the 64 different codons genome‐wide (without even restricting to coding regions or to a particular “framing,” these are still unknowns, initially, in an ab initio analysis). This should reveal oddly low counts for what will turn out to be the “stop” codons.
9 3.9 Using the code examples with stops used to mark boundaries, identify long ORF regions in the E. coli genome (the one described in (Exercise 3.7)). Produce an ORF length histogram like that shown in Figure 3.2.
10 3.10 Create an overlap‐encoding topology scoring method (like that described for Figure 3.3), and use it to obtain topology histograms like those shown in Figure 3.3a and Figure 3.4. Do this for the following genomes: E. coli (K‐12 strain); V. cholera; and Deinococcus Radiodurans.
11 3.11 In a highly trusted coding region, such as the top 10% of the longest ORFs in the E. coli genome analysis, perform a gap IMM analysis, which should approximately reproduce the result shown in Figure 3.5.
12 3.12 Prove that relative entropy is always positive.