Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 44
3 Information Entropy and Statistical Measures
ОглавлениеIn this chapter, we start with a description of information entropy and statistical measures (Section 3.1). Using these measures we then examine “raw” genomic data. No biology or biochemistry knowledge is needed in doing this analysis and yet we almost trivially rediscover a three‐element encoding scheme that is famous in biology, known as the codon. Analysis of information encoding in the four element {a, c, g, t} genomic sequence alphabet is about as simple as you can get (without working with binary data), so it provides some of the introductory examples that are implemented. A few (simple) statistical queries to get the details of the codon encoding scheme are then straightforward (Section 3.2). Once the encoding scheme is known to exist, further structure is revealed via the anomalous placement of “stop” codons, e.g. anomalously large open reading frames (ORFs) are discovered. A few more (simple) statistical queries from there, and the relation of ORFs to gene structure is revealed (Section 3.3). Once you have a clear structure in the sequential data that can be referenced positionally, it is then possible to gather statistical information for a Markov model. One example of this is to look at the positional base statistics at various positions “upstream” from the start codon. We thereby identify binding sites for critical molecular interaction in both transcription and translation. Since the Markov model is needed in analysis of sequential processes in general for what is discussed in later chapters (Chapters 6 and 7 in particular), a review of Markov models, and some of their specializations, are given in Section 3.4 (Chapters 6 and 7 covers Hidden Markov models, or HMMs).
Numerous prior book, journal, and patent publications by the author are drawn upon throughout the text [1–68]. Almost all of the journal publications are open access. These publications can typically be found online at either the author’s personal website (www.meta‐logos.com) or with one of the following online publishers: www.m‐hikari.com or bmcbioinformatics.biomedcentral.com.