Читать книгу Experimental Design and Statistical Analysis for Pharmacology and the Biomedical Sciences - Paul J. Mitchell - Страница 34
Classification of data distributions
ОглавлениеIf we collect a large number of observations in an experiment and from these data produce a histogram plot (where the frequency of the observations is plotted against magnitude), then the resulting figure represents a summary of the distribution of the data. With sufficient number of observations, then the frequency of occurrence of each observation is closely related to the probability that future observations will have a particular value. Furthermore, the distributions created by our data often map to distributions that are mathematically generated. Each distribution is defined by an equation, and this allows the probability of a given score to be calculated. Probability distributions depend on the form of the data obtained (see form of measurement data, above) and consequently are either discrete or continuous; in both cases discrete probability distributions and continuous probability distributions are statistical functions that provide a way of mapping out the likelihood that an observation will have a given value.
The different types of theoretical mathematical distributions are summarised in Table 4.1. Some of these probability distributions are outside the remit of this book and are only included here for completeness.
Table 4.1 Classification of probability distributions.
Probability distributions | |
---|---|
Discrete distributions | Continuous distributions |
Uniform distribution | Uniform distribution |
Bernoulli distribution | Exponential distribution |
Binomial distribution | Gamma distribution |
Poisson distribution | Normal distribution |
Geometric distribution | Chi‐square distribution |
Negative binomial distribution | Student‐t distribution |
Hypergeometric distribution | F distribution |
Beta distribution | |
Weibull distribution | |
Gumbel distribution |
In contrast, the discrete and continuous probability distributions described briefly below are either of theoretical interest or are distributions which may inform our decisions regarding which statistical methods we use to describe and analyse data from our experiments. Care must be taken, however, to ensure that we interpose our experimental data on the correct mathematical distribution, since if we make an incorrect decision, then we risk arriving at erroneous conclusions.
1 Discrete uniform distribution (Figure 4.2)This is a very simple distribution where the probability of each equally spaced possible integer values is equal. An example would be rolling fair six‐sided dice; here the probability of each side occurring would be equal, i.e. there would be six equally like probabilities. In standard statistical nomenclature, p is the variable used to denote probability. So here p = 1/6 = 0.167.
2 Bernoulli distribution (Figure 4.3)Whenever you toss a coin, then the only outcomes are either the coin lands heads or tails uppermost; the question being asked here is ‘will this single trial succeed’. This is an example of the most basic of random events where a single event may only have one or two possible outcomes with a fixed probability of each occurring. The Bernoulli distribution has only one controlling parameter which is the probability of success according to whether you call heads or tails; in both cases the probabilities of success or failure in a single trial are equal and will have a probability of 0.5 (i.e. p = 0.5).Figure 4.2 The discrete uniform distribution. X‐axis values indicate the resulting number shown on the throw of a six‐sided dice. Y‐axis values indicate the relative probability density.Figure 4.3 The Bernoulli distribution. X‐axis values indicate the resulting outcome from only two possibilities, e.g. success or failure to throw a heads on the toss of a coin. Y‐axis values indicate the relative probability density.
3 Binomial distribution (Figure 4.4)The binomial distribution is an extension of the Bernoulli distribution to include multiple success or failure trials with a fixed probability. Consequently, the binomial distribution addresses the question ‘out of a given number of trials, how many will be successful’? So, if you tossed a coin 10 times, in how many of these trials would the coin land heads? With a fair coin you would expect five heads and five tails as the outcomes of the 10 trials. But what is the probability of only two heads, or nine heads, or no heads at all? For those who are interested (!), the probability of obtaining exactly k success in n trials is given by the binomial probability mass function and is discussed in detail in Appendix A.1.Figure 4.4 The binomial distribution. 250 undergraduate students were asked to toss a coin 10 times and count the number of times the coin landed heads uppermost. The X‐axis indicates the number of times the coin was successfully tossed to land heads from 10 trials. The Y‐axis indicates a) the predicted number of students for each level of success according to probability mass function for the binomial distribution (thin solid line) and b) the observed number of students for each level of outcome (open bars). For further discussion and calculation of the binomial probability mass function, see Appendix A.1.
4 Poisson distributionThe Poisson distribution (which is very similar to the binomial distribution) examines how many times a discrete event will occur within a given period of time.
5 Continuous uniform distributionThis is very simple distribution where (as with the discrete uniform distribution, see Figure 4.2) the probability densities for each value are equal. In this situation, however, the measured values are not limited to integers.
6 Exponential distribution (Figure 4.5)The exponential distribution is used to map the time between independent events that happen at a constant rate. Examples of this include the rate at which radioactive particles decay and the rate at which drugs are eliminated from the body according to first‐order pharmacokinetic principles (Figure 4.6). For calculation of the exponential probability mass function, see Appendix A.2.Figure 4.5 The exponential distribution. The probability density function of the exponential distribution for events that happen at a constant rate, λ. Curves shown are for rate values of λ = 0.5 (bold line), 1.0 (thin line), 1.5 (dashed line), and 2.0 (dotted line). X‐axis values indicate the stochastic variable, x. Y‐axis values indicate probability (see also Appendix A.2).
7 Normal Distribution (Figure 4.7)The Normal Distribution (also known as the Gaussian distribution) is the most widely used distribution, particularly in the biological sciences where most (but not all) biologically derived data follow a Normal Distribution. A Normal Distribution assumes that all measurement/observations are tightly clustered around the population mean, μ, and that the frequency of the observations decays rapidly and equally the further the observations are above and below the mean, thereby producing a characteristic bell shape (see Figure 4.7; for calculation of the probability density of the Normal Distribution, see Appendix A.3.). The spread of the data either side of the population mean is quantified by the variance, σ 2; where the square root of the variance is the Standard Deviation, σ (see Chapter 5). The normal distribution with these parameters is usually denoted as N with the values of the population mean and standard deviation immediately following in parenthesis, thus; N(μ, σ). Every normal distribution is a version of the simplest case where the mean is set to zero and the standard deviation equals 1; this is denoted as N(0, 1) and is known as the Standard Normal Distribution (see Chapter 7). Furthermore, the area under the curve of the Standard Normal Distribution is equal to 1, the consequence of which means that sections defined by multiples of the standard deviation either side of the mean equate to specific proportions of the total area under the curve. Because Normal Distribution curves are parameterised by their corresponding Mean and Standard Deviation values, then such data are known as parametric. Consequently, parametric statistics (both Descriptive and Inferential Statistics) assume that sample data sets come from data populations that follow a fixed set of parameters. In contrast, non‐parametric data sets, and their corresponding Descriptive and Inferential Statistics, are also called distribution‐free because there are no assumptions that the data sets follow a specific distribution. Appreciation of the qualities of the Normal Distribution (and the Standard form) and differences to non‐parametric data are fundamental in informing our strategy to analyse experimental pharmacological data. As we shall see later, there are numerous statistical tests available to analyse data that is Normally Distributed, and these provide very powerful, robust, procedures the results of which in turn allow us to derive conclusions from our experimental data.Figure 4.6 Plasma concentration of drug X following intravenous administration. Upper panel: X‐axis values indicate time post‐administration. Y‐axis values indicate plasma concentration (ng/ml) plotted on a linear scale. Lower panel: X‐axis values indicate time post‐administration. Y‐axis values indicate plasma concentration (ng/ml) plotted on a Log10 scale. Half‐life (t ½) of drug X equals 1 hour.Figure 4.7 The Normal Distribution curve, N(30,2). The Normal Distribution curve has a Mean of 30 and a Standard Deviation of 2. X‐axis values indicate magnitude of the observations, while the Y‐axis indicates the probability density function (see also Appendix A.3).
8 Chi‐square distribution (Figure 4.8)The Chi‐squared distribution is used primarily in hypothesis testing (see appropriate sections in Inferential Analysis) due to its close relationship to the normal distribution and is also a component of the definition of the t‐distribution and the F‐distribution (see below). In the simplest terms, the Chi‐squared distribution is the square of the standard normal distribution. The Chi‐squared distribution is used in Chi‐squared tests of independence in contingency tables used for categorical data (see Pearson's Chi‐squared test, Chapter 21), to determine how well an observed distribution of data fits with the expected theoretical distribution of the data if the variables are independent and in Chi‐squared tests for variance in a population that follows a normal distribution.Figure 4.8 The Chi‐square distribution. The probability density function for the Chi‐squared distribution with 1 (bold solid line), 2 (thin solid line), 5 (dashed line), and 10 (dotted line) degrees of freedom. X‐axis values indicate Chi‐squared (χ 2) and Y‐axis indicates probability (see also Appendix A.4).
9 Student‐t distribution (Figure 4.9)The Student t‐distribution is derived from the Chi‐square and normal distributions. The distribution is symmetrical and bell‐shaped, very much like the Normal Distribution (see Figure 4.7) but with greater area under the curve in the tails of the distribution. The t‐distribution arises when the mean of a set of data that follows a normal distribution is estimated where the sample size is small and the population standard deviation (σ) is unknown. As the sample size increases so the t‐distribution approximates more closely to the standard normal distribution. The t‐distribution plays an important role in assessing the probability that two sample means arise from the same population, in determining the confidence intervals for the difference between two population means (see Chapters 11 and 12) and in linear regression analysis (see Chapter 20).
10 F distribution (Figure 4.10)The F distribution (named after Sir Ronald Fisher, who developed the F distribution for use in determining the critical values for the Analysis of Variance (ANOVA) models; see Chapters 15, 16 and 17) is a function of the ratio of two independent random variables (each of which has a Chi‐square distribution) divided by its respective number of Degrees of Freedom. It is used in several applications including assessing the equality of two or more population variances and the validity of equations following multiple regression analysis. The F‐distribution has two very important properties; first, it is defined for positive values only (this makes sense since all variance values are positive!), and second, unlike the t‐distribution, it is not symmetrical about its mean but instead is positively skewed.