Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 41

2.6.2.3 Significant Distributions That Are Not Gaussian or Geometric

Оглавление

Nongeometric duration distributions occur in many familiar areas, such as the length of spoken words in phone conversation, as well as other areas in voice recognition. Although the Gaussian distribution occurs in many scientific fields (an observed embodiment of the LLN, among other things), there are a huge number of significant (observed) skewed distributions, such as heavy‐tailed (or long‐tailed) distributions, multimodal distributions, etc.

Heavy‐tailed distributions are widespread in describing phenomena across the sciences. The log‐normal and Pareto distributions are heavy‐tailed distributions that are almost as common as the normal and geometric distributions in descriptions of physical phenomena or man‐made phenomena. Pareto distribution was originally used to describe the allocation of wealth of the society, known as the famous 80–20 rule, namely, about 80% of the wealth was owned by a small amount of people, while “the tail,” the large part of people only have the rest 20% wealth. Pareto distribution has been extended to many other areas. For example, internet file‐size traffic is a long‐tailed distribution, that is, there are a few large sized files and many small sized files to be transferred. This distribution assumption is an important factor that must be considered to design a robust and reliable network and Pareto distribution could be a suitable choice to model such traffic. (Internet applications have many other heavy‐tailed distribution phenomena.) Pareto distributions can also be found in a lot of other fields, such as economics.


Figure 2.4 The Gaussian distribution, aka Normal, shown with mean zero and variance equal to one: Nx(μ,σ2) = Nx(0,1).

Log‐normal distributions are used in geology and mining, medicine, environment, atmospheric science, and so on, where skewed distribution occurrences are very common. In Geology, the concentration of elements and their radioactivity in the Earth's crust are often shown to be log‐normal distributed. The infection latent period, the time from being infected to disease symptoms occurs, is often modeled as a log‐normal distribution. In the environment, the distribution of particles, chemicals, and organisms is often log‐normal distributed. Many atmospheric physical and chemical properties obey the log‐normal distribution. The density of bacteria population often follows the log‐normal distribution law. In linguistics, the number of letters per words and the number of words per sentence fit the log‐normal distribution. The length distribution for introns, in particular, has very strong support in an extended heavy‐tail region, likewise for the length distribution on exons or open reading frames (ORFs) in genomic deoxyribonucleic acid (DNA). The anomalously long‐tailed aspect of the ORF‐length distribution is the key distinguishing feature of this distribution, and has been the key attribute used by biologists using ORF finders to identify likely protein‐coding regions in genomic DNA since the early days of (manual) gene structure identification.

Informatics and Machine Learning

Подняться наверх