Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 43
2.7 Exercises
Оглавление1 2.1 Evaluate the Shannon Entropy, by hand, for the fair die probability distribution: (1/6,1/6,1/6,1/6,1/6,1/6), for the probability of rolling a 1 thru a 6 (all are the same, 1/6, for uniform prob. Dist). Also evaluate for loaded die: (1/10,1/10,1/10,1/10,1/10,1/2).
2 2.2 Evaluate the Shannon Entropy for the fair and loaded probability distribution in Exercise 2.1 computationally, by running the program described in Section 2.1.
3 2.3 Now consider you have two dice, where each separately rolls “fair,” but together they do not roll “fair,” i.e. each specific pair of die rolls does not have probability 1/36, but instead has probability:Die 1 rollDie 2 rollProbability11(1/6)*(0.001)12(1/6)*(0.125)13(1/6)*(0.125)14(1/6)*(0.125)15(1/6)*(0.124)16(1/6)*(0.5)2Any(1/6)*(1/6)3Any(1/6)*(1/6)4Any(1/6)*(1/6)5Any(1/6)*(1/6)61(1/6)*(0.5)62(1/6)*(0.125)63(1/6)*(0.125)64(1/6)*(0.125)65(1/6)*(0.124)66(1/6)*(0.001)What is Shannon Entropy for the Die 1 outcomes? (call H(1)) What is the Shannon entropy of the Die 2 outcomes (refer to as H(2))? What is the Shannon entropy on the two‐dice outcomes with probabilities shown in the table above (denote (H(1,2))?Compute the function MI(Die 1,Die 2) = H(1) + H(2) − H(1,2). Is it positive?
4 2.4 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of a small virus (~10 kb). Using the Python code shown in Section 2.1, determine the base frequencies for {a,c,g,t}. What is the shannon entropy (if those frequencies are taken to be the probabilities on the associated outcomes)?
5 2.5 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank/) and select the genome of three medium‐sized viruses (~100 kb). Using the Python code shown in Section 2.1, determine the trinucleotide frequencies. What is the Shannon entropy of the trinucleotide frequencies for each of the three virus genomes? Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?
6 2.6 Repeat (Exercise 2.5) but now use symmetrized relative entropy between the trinucleotide probability distributions as a distance measure instead (reevaluate pairwise between the three viruses). Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?
7 2.7 Prove that relative entropy is always positive (hint: use Jensen's Inequality from Section 2.4).
8 2.8 What is the Expectation for the two‐dice roll with pair outcome probabilities listed in (Exercise 2.3)?
9 2.9 What is the Expectation for the two‐dice roll with fair dice? Is this expectation an actual outcome possibility? What does it mean if it is not?
10 2.10 Survey the literature and write a report on common occurrences of distributions of the type: uniform, geometric, exponential, Gaussian, log‐normal, heavy‐tail.
11 2.11 Survey the literature and write a report on common occurrences of series of the type: Martingale.
12 2.12 Consider the loaded die example, where the probability of rolling a 1,2,3,4, or 5, is 0.1, and the probability of rolling a 6 is 0.5.What is the expectation for the loaded die?What is its variance?What is its mode?What is its median?The LLN for the loaded die above indicates that a sequence of rolls could be done and if its average tends toward 4.5, you know it is loaded, and if it goes toward 3.5, you know it is fair. So it comes down to how soon you can resolve that its converging on these two possible expectations differing by 1.0. Suppose someone is rolling a die that is either fair or loaded as described above, how many rolls do you think you will need to see before it will be obvious how the average is trending? Is the better way to spot the anomaly? Like frequency of seeing three sixes in a row is notably skewed?
13 2.13 You have a genomic sequence of length L. (For DNA genomes you have approximately 10**4 for viruses, 10**6 for bacteria, and 10**9 for mammals.) A typical analysis is to get counts on subsequences of length N within the full sequence L, where there are L − N + 1 subsequences of length N (by sliding a window of width N across the length L sequence and taking the window samples accordingly). The number of possible subsequences of length N grows exponentially with increase in that length. For DNA subsequences of length six bases, the 6mers, with four base possibilities, {a,c,g,t}, there are thus 4**6 = 4096 possible 6mers. If the 6mers are equally probable, then in the approximate 10 000 length of a virus each 6mer might be seen a couple times (10 000/4096 to be precise), while a particular 6mer can be seen millions of times in mammalian genomes. Sounds fine so far, but now consider an analysis of 25mers…. The possible 25mers number 4**25 = 2**50 = (2**10)**5 = 1024**5 = 10**15. So, a million billion possibilities…. It turns out that DNA information does not have subsequences with approximately equal statistical counts (equal probabilities), but, instead, is highly structured with a variety of overlapping encoding schemes, so has subsequences with very unequal statistics. The vast majority of the 25mer subsequences, in fact, will have zero counts such that enumeration of the possibilities ahead of time in an array data‐structure is not useful or even possible in some cases, which then leads to associative arrays in this context as shown in the sample code. Do a 25mer analysis on bacterial genome (get from genbank, like E. coli). What is the highest count 25mer subsequence?
14 2.14 A rare genetic disease has probability P(disease) = 0.0000001. Suppose you have a test for this condition with sensitivity that is perfect (SN = 1.0) and with specificity that is 99.99% correct (SP = 0.9999) (i.e. false positives occur with probability 0.0001). Is this test useful in practice?
15 2.15 Prove P(X,Y|Z) = P(X|Z) P(Y|X,Z).
16 2.16 Prove the Bonferroni inequality: P(X,Y) ≥ P(X) + P(Y) − 1.
17 2.17 Suppose you are on a game show (with Monty Hall), and you are given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to change your pick door to No. 2?” Is it to your advantage to switch your choice? Prove by tabulation of possibilities, and then prove using Bayes' Rule with appropriate choice of variables.
18 2.18 Prove E(Y/X) ≥ E(Y)/E(X).