Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 29
2.1.1 Sample Size Complications
ОглавлениеThe 6‐nucleotide statistics analyzed in prog1.py in the preceding is typically called a hexamer statistical analysis. Where the window‐size for extracting the substrings has “‐mer” appended, thus six‐mer or hexamer. The term “‐mer” comes from oligomer, a polymer containing a small number of monomers in its specification. In the case of the hexamers we saw that there were 4096 possible hexamers, or length six substrings, when the “alphabet” of monomer types consists of four elements: a,c,g, and t. In other words, there are 46 = 4096 such substrings. In the Norwalk virus analysis this large number of different things to count, versus sample size overall, raises sampling questions. The Norwalk virus has a genome that is only 7654 nucleotides long. As we sweep the six‐base window over that string to extract all of the hexamer counts we then obtain only 7654 − 5 = 7649 hexamer samples! Even with uniform distribution we will be getting barely two counts for most of the different hexamer types! Limitations due to sample size play a critical role in these types of analysis.
The Norwalk virus genome is actually smaller than the typical viral genome, which ranges between 10 000 and 100 000 bases in length. Prokaryotic genomes typically range between 1 and 10 million bases in length. While the human genome is approximately three billion bases in length (3.23 Gb per haploid genome, 6.46 Gb total diploid). To go forward with a “strong” statistical analysis in the current discussion, the key as with any statistical analysis, is sample size, which is obviously dictated in this analysis by genome size. So to have “good statistics” meaning to have sufficient samples that frequencies of outcomes provide a good estimation of the underlying probabilities for those outcomes, we will apply the methods developed thus far to a bacterial genome in Chapter 3 (the classic model organism, E. coli.). In this instance the genome size will be approximately four and a half million bases in length, so much better counts should result than with the 7654 base Norwalk virus genome.