Читать книгу Applied Biostatistics for the Health Sciences - Richard J. Rossi - Страница 53
2.2.1 Distributions
ОглавлениеA statistical analysis of a population is centered on how the values of a variable are distributed, and the distribution of a variable or population is an explicit description of how the values of the variable are distributed often described in terms of percentages. The distribution of a variable is also called a probability distribution because it describes the probabilities that each of the possible values of the variable will occur. Moreover, the distribution of a variable is often presented in table or chart or modeled with a mathematical equation that explicitly determines the percentage of the population taking on each possible value of the variable. The total percentage in a probability distribution is 100%. The distribution of a qualitative or a discrete variable is generally displayed in a bar chart or in a table, and the distribution of a continuous variable is generally displayed in a graph or is represented by a mathematical function.
Example 2.9
The four basic classifications of blood type are O, A, B, and AB. The distribution of blood type, according to the American Red Cross, is given in Table 2.1, and a bar chart representing this distribution is shown in Figure 2.3. Based on the information in Table 2.1, 45% of Americans have type O blood, 40% have type A, 11% have type B, and 4% have type AB blood.
Figure 2.3 A bar chart of the distribution of blood types in the United States.
Table 2.1 The Distribution of Blood Type According to the American Red Cross
Blood Type | Percentage |
---|---|
O | 45% |
A | 40% |
B | 11% |
AB | 4% |
Another method of classifying blood types is to represent blood type by type and Rh factor. A bivariate distribution of blood type for the variables type and Rh factor is given in Table 2.2 and the bar chart in Figure 2.4.
Figure 2.4 A bar chart of the distribution of blood types and Rh factor in the United States.
Table 2.2 The Distribution of Blood Types with Rh Factor
Rh Factor | ||
---|---|---|
Type | + | − |
O | 38% | 7% |
A | 34% | 6% |
B | 9% | 2% |
AB | 3% | 1% |
Example 2.10
One of the goals of the 1989 Wisconsin Behavioral Risk Factor Surveillance System (BRFS) was to estimate the distribution of adults who count calories. The distribution of male and female adults in Wisconsin who count calories is given in Table 2.3. Based on the information in Table 2.3, the percentage of females who do not count calories is 69.6% and the percentage of males who do not count calories is 84.8%. Note that there are actually two distributions given in Table 2.3.
Table 2.3 The Distribution of Adults who Count Calories Based on the 1989 Wisconsin BRFS by Age and Gender
Sex | Calories Eaten Per Day | ||
---|---|---|---|
% 1200 or Less | % > 1200 | % Do Not Count | |
Male | 4.6 | 10.6 | 84.8 |
Female | 19.0 | 11.5 | 69.6 |
The distribution of a continuous quantitative variable is often modeled with a mathematical function called the probability density function. The probability density function explicitly describes the distribution of the values of the variable. A plot of the probability density function provides a graphical representation of the distribution of a variable, and the area under the curve defined by the probability density function corresponds to the percentage of the population falling between these two values. The height of the curve at a particular value of the variable measures the percentage per unit in the distribution at this point and is called the density of the population at this point. Regions where the values of the variable are more densely grouped are areas in the graph of a probability density function where it is tallest. Examples of the most common shapes of the distribution of a continuous variable are given in Figures 2.5–2.8.
Figure 2.5 An example of a mound-shaped distribution.
Figure 2.6 An example of a distribution with a long tail to the right.
Figure 2.7 An example of a distribution with a long tail to the left.
Figure 2.8 An example of a bimodal distribution.
The value of the population under the peak of a probability density graph is called a mode. A distribution can have more than one mode, and a distribution with more than one mode is called a multimodal distribution. When a distribution has two or more modes, this usually indicates that there are distinct subpopulations clustering around each mode. In this case, it is often more informative to have separate graphs of the probability distributions for analyzing each of the subpopulations.
Example 2.11
In studying obsessive compulsive disorder (OCD), the age at onset is an important variable that is believed to be related to the neurobiological features of OCD; OCD is classified as being either Child Onset OCD or Adult Onset OCD. In the article “Is age at symptom onset associated with severity of memory impairment in adults with obsessive-compulsive disorder?” published in the American Journal of Psychiatry (Henin et al., 2001), the authors reported the distribution of the age for onset of OCD given in Figure 2.9. Because there are two modes (peaks) in Figure 2.9, the distribution is suggesting that there might be two different distributions for the age of onset of OCD, one for children and one for adults. Because the clinical diagnoses are Child Onset OCD and Adult Onset OCD, it is more informative to study each of these subpopulations separately. Thus, the distribution of age of onset of OCD has been separated into distributions for the distinct classifications as Child Onset OCD and Adult Onset OCD that are given in Figure 2.10.
Figure 2.9 Distribution of age at which OCD is diagnosed.
Figure 2.10 Distribution of the age at which OCD is diagnosed for Child Onset OCD and Adult Onset OCD.
The shape of the distribution of a discrete variable can also be described as long-tail right, mound shaped, long-tail left, or multimodal. For example, the 2005 National Health Interview Survey (NHIS) reports the distribution of the size of a family, a discrete variable, and the distribution according to the 2005 National Health Interview Survey is given in Figure 2.11. Note that the distribution of family size according to the 2005 NHIS data is a long-tail right discrete distribution.
Figure 2.11 Distribution of family size according to the 2005 NHIS.