Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 28
2.1 Python Shell Scripting
ОглавлениеA “fair” die has equal probability of rolling a 1, 2, 3, 4, 5, or 6, i.e. a probability of 1/6 for each of the outcomes. Notice how the sum of all of the discrete probabilities for the different outcomes all add up to 1, this is always the case for probabilities describing a complete set of outcomes.
A “loaded” die has a non‐uniform distribution, for prob = 0.5 to roll a “6” and uniform on the other die rolls you have loaded die_roll_probability = (1/10,1/10,1/10,1/10,1/10,1/2).
The first program to be discussed is named prog1.py and will introduce the notion of discrete probability distributions in the context of rolling the familiar six‐sided die. Comments in Python are the portion of a line to the right of any “#” symbol (except for the first line of code with “#!.....”, that is explained later).
The Shannon entropy of a discrete probability distribution is the measure of its amount of randomness, with the uniform probability distribution having the greatest randomness (e.g. it is most lacking in any statistical “structure” or “information”). Shannon entropy is the sum of each outcome probability times its log probability, with an overall negative placed in front to arrive at a definition involving a positive value. Further details on the mathematical formalism will be given in Chapter 3, but for now we can implement this in our first Python program:
-------------------------- prog1.py ------------------------- #!/usr/bin/python import numpy as np import math import re arr = np.array([1.0/10,1.0/10,1.0/10,1.0/10,1.0/10,1.0/2]) # print(arr[0]) shannon_entropy = 0 numterms = len(arr) print(numterms) index = 0 for index in range(0, numterms): shannon_entropy += arr[index]*math.log(arr[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) ----------------------- end prog1.py ------------------------
The maximum Shannon entropy on a system with six outcomes, uniformly distributed (a fair die), is log(6). In the prog1.py program above we evaluate the Shannon entropy for a loaded die: (1/10.1/10,1/10,1/10,1/10,1/2). Notice in the code, however, that I use “1.0” not “1”. This is because if the expression only involves integers the mathematics will be done as integer operations (returning an integer, thus truncation of some sort). An expression that is mixed, some integer terms, some floating point (with a decimal), will be evaluated as a floating point number. So, to force recognition of the numbers as floating point the “1” value in the terms is entered as “1.0”. Further tests are left to the Exercises (Section 2.7).
A basic review of getting a linux system running, with it is standard Python installed, is described in the Appendix, along with a discussion of how to install added Python modules (added code blocks with very useful, pre‐built, data structures, and subroutines), particularly “numpy,” which is indicated as a module to be imported (accessed) by the program by the first Python command: “import numpy as np.” (We will see in the Appendix that the first line is not a Python command but a shell directive as to what program to use to process the commands that follow, and this is the mechanism whereby a system level call on the Python script can be done.)
Let us now move on to some basic statistical concepts. How do we know the probabilities for the outcomes of the die roll? In practice, you would observe numerous die rolls and get counts of how many times the various outcomes were observed. Once you have counts, you can divide by the total counts to have the frequency of occurrence of the different outcomes. If you have enough observational data, the frequencies then become better and better estimates of the true underlying probabilities for those outcomes for the system observed (a result due to the law of large numbers (LLN), which is rederived in Section 2.6.1). Let us proceed with adding more code in prog1.py that begins with counts on the different die rolls:
------------------ prog1.py addendum 1 ----------------------- rolls = np.array([3435.0,3566,3245,3600,3544,3427]) numterms = len(rolls) total_count = 0 for index in range(0,numterms): total_count += rolls[index] print(total_count) probs = np.array([0.0,0,0,0,0,0]) for index in range(0,numterms): probs[index] = rolls[index]/total_count; print(probs) -------------------- end prog1.py addendum 1 -----------------
Some notes on syntax: “len” is a Python function that returns the length of (number of items in) an array (from the numpy module). Notice how the probs array initialization has one entry as 1.0 and the others just 1. Again, this is an instance where the data structure must have components of the same type and if presented with mixed type will promote to a default type that represents the least loss of information (typically), in this instance, the “1.0” forces the array to be an array of floating point (decimal) numbers, with floating point arithmetic (for the division in the frequency evaluation used as the estimate of the probability in the “for loop”).
At this point we can estimate a new probability distribution based on the rolls observed, for which we are interested in evaluating the Shannon entropy. To avoid repeatedly copying and pasting the above code for evaluating the Shannon entropy, let us create a subroutine, called “shannon” that will do this standard computation. This is a core software engineering process, whereby tasks that are done repeatedly become recognized as such, and become rewritten as subroutines, and then need no longer be rewritten. Subroutines also avoid clashes in variable usage, compartmentalizing their variables (whose scope is only in their subroutine), and more clearly delineate what information is “fed in” and what information is returned (e.g. the application programming interface, or API).
----------------------- prog1.py addendum 2 ------------------ def shannon( probs ): shannon_entropy = 0 numterms = len(probs) print(numterms) for index in range(0, numterms): print(probs[index]) shannon_entropy += probs[index]*math.log(probs[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) return shannon_entropy shannon(probs) value = shannon(probs) print(value) -------------------- end prog1.py addendum 2 -----------------
If we do another set of observations, getting counts on the different rolls, we then need to repeat the process of converting those counts to frequencies… so it is time to elevate the count‐to‐frequency computation to subroutine status as well, as is done next. The standard syntactical structure for defining a subroutine in Python is hopefully starting to become apparent (more detailed Python notes are in Appendix A).
------------------- prog1.py addendum 3 ---------------------- def count_to_freq( counts ): numterms = len(counts) total_count=0 for index in range(0,numterms): total_count+=counts[index] probs = counts # to get memory allocation for index in range(0,numterms): probs[index] = counts[index]/total_count return probs probs = count_to_freq(rolls) print(probs) ----------------- end prog1.py addendum 3 --------------------
Is genomic DNA random? Let us read thru a dna file, consisting of a sequence of a,c,g, and t's, and get their counts… then compute the shannon entropy vs. random (uniform distribution, e.g. p = 1/4 for each of the four possibilities). In order to do this we must learn file input/output (i/o) to “read” the data file:
------------------ prog1.py addendum 4 ----------------------- fo = open("Norwalk_Virus.txt", "r+") str = fo.read() # print(str) fo.close() ---------------- end prog1.py addendum 4 ---------------------
Notes on syntax: the example above shows the standard template for reading a data file, where the datafile's name is Norwalk_Virus.txt. The subroutine “open” is a Python command that handles file i/o. As its name suggests, it “opens” a datafile.
Figure 2.1 The Norwalk virus genome (the “cruise ship virus”).
The Norwalk virus file has nonstandard format and is shown in its entirety in Figure 2.1 (split into two columns). The Escherichia coli genome (Figure 2.2), on the other hand, has standard FASTA format. (FASTA is the name of a program (~1985), where a file format convention was adopted, allowing “flat‐file” record access that has been used in similar form ever since.)
The E. coli genome file is shown only for the first part (it is 4.6 Mb) in Figure 2.2, where the key feature of the FASTA file is apparent on line 1, where a “>” symbol should be present, indicating a label (or comment – information that will almost always be present):
Python has a powerful regular expression module (named “re” and imported in the first code sample of prog1.py). Regular expression processing of strings of characters is a mini programming language in its own right, thus a complete list of the functionalities of the re module will not be given here. Focusing on the “findall” function, it does as its name suggests, it finds all entries matching the search string characters in an array in the order they are encountered in the string. We begin with the string comprising the data file read in the previous example. We now traverse the string searching for characters matching those specified in the pattern field. The array of a,c,g,t's that results has conveniently stripped any numbers, spaces, or line returns in this process, and a straightforward count can be done:
Figure 2.2 The start of the E. coli genome file, FASTA format.
---------------------- prog1.py addendum 5 ------------------- pattern = '[acgt]' result = re.findall(pattern, str) seqlen = len(result) # sequence = "" # sequence = sequence.join(result) # print(sequence) print("The sequence length of the Norwalk genome is: ") print(seqlen) a_count=0 c_count=0 g_count=0 t_count=0 for index in range(0,seqlen): if result[index] == 'a': a_count+=1.0 elif result[index] == 'c': c_count+=1.0 elif result[index] == 'g': g_count+=1.0 elif result[index] == 't': t_count+=1.0 else: print("bad char\n") norwalk_counts = np.array([a_count, c_count, g_count, t_count]) print(norwalk_counts) norwalk_probs = np.array([0.0,0,0,0]) norwalk_probs = count_to_freq(norwalk_counts) value = shannon(norwalk_probs) print(value) -------------------- end prog1.py addendum 5 -----------------
We now traverse the array of single acgt's extracted from the raw genome data file, and increment counters associated with the acgt's as appropriate. At the end we have gotten the needed counts, and can then use our subroutines to see what Shannon entropy occurs.
Note on the informatics result: notice how the Shannon entropy for the frequencies of {a,c,g,t} in the genomic data differs only slightly from the Shannon entropy that would be found for a perfectly random, uniform, probability distribution (e.g. log(4)). This shows that although the genomic data is not random, i.e. the genomic data holds “information” of some sort, but we are only weakly seeing it at the level of single base usage.
In order to see clearer signs of nonrandomness, let us try evaluating frequencies at the base‐pair, or dinucleotide, level. There are 16 (4 × 4) dinucleotides that we must now get counts on:
------------------ prog1.py addendum 6 ----------------------- di_uniform = [1.0/16]*16 stats = {} for i in result: if i in stats: stats[i]+=1 else: stats[i]=1 #for i in sorted(stats, key=stats.get): for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) stats = {} for index in range(1,seqlen): dinucleotide = result[index-1] + result[index] if dinucleotide in stats: stats[dinucleotide]+=1 else: stats[dinucleotide]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) -------------------- end prog1.py addendum 6 -----------------
In the above example we see our first use of hash variables (“stats”) in keeping tabs on counts of occurrences of various outcomes. This is a fundamental way to perform such counts without enumerating all of the outcomes beforehand (which results in what is known as the “enumeration problem,” which is not really a problem, just a poor algorithmic approach). Further discussion of the enumeration “problem” and how it can be circumvented with use of hash variables will be described in Section 2.2.
The sequence information is traversed in a manner such that each of the dinucleotides is counted in the order seen, where the dinucleotide is extracted as a “window” of width two bases is slid across the genomic sequence. Each dinucleotide is entered into the count hash variable as a “key” entry, with the associated “value” being an increment on the count already seen and held as the old “value.” These counts are then transferred to an array to make use of our prior subroutines count_to_freq and Shannon.
In the results for Shannon entropy on dinucleotides, we still do not see clear signs of nonrandomness. Similarly, let us try trinucleotide level. There are 64 (4 × 4 × 4) trinucleotides that we must now get counts on:
-------------------- prog1.py addendum 7 --------------------- stats = {} order = 3 for index in range(order-1,seqlen): xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) ---------------- end prog1.py addendum 7 ---------------------
Still do not see real clear signs of non‐random at tribase‐level! So let us try 6‐nucleotide level. There are 4096 6‐nucleotides that we must now get counts on:
----------------- prog1.py addendum 8 ------------------------ def shannon_order( seq, order ): stats = {} seqlen = len(seq) for index in range(order-1,seqlen): xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 nonzerocounts = len(stats) print("nonzerocounts=") print(nonzerocounts) counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) value = shannon(probs) print "The shannon entropy at order", order, "is:", value, "." shannon_order(result,6) ------------------- end prog1.py addendum 8 ------------------