Читать книгу Applied Regression Modeling - Iain Pardoe - Страница 16
1.2 Population Distributions
ОглавлениеWhile the methods of the preceding section are useful for describing and displaying sample data, the real power of statistics is revealed when we use samples to give us information about populations. In this context, a population is the entire collection of objects of interest, for example, the sale prices for all single‐family homes in the housing market represented by our dataset. We would like to know more about this population to help us make a decision about which home to buy, but the only data we have is a random sample of 30 sale prices.
Nevertheless, we can employ “statistical thinking” to draw inferences about the population of interest by analyzing the sample data. In particular, we use the notion of a model—a mathematical abstraction of the real world—which we fit to the sample data. If this model provides a reasonable fit to the data, that is, if it can approximate the manner in which the data vary, then we assume it can also approximate the behavior of the population. The model then provides the basis for making decisions about the population, by, for example, identifying patterns, explaining variation, and predicting future values. Of course, this process can work only if the sample data can be considered representative of the population. One way to address this is to randomly select the sample from the population. There are other more complex sampling methods that are used to select representative samples, and there are also ways to make adjustments to models to account for known nonrandom sampling. However, we do not consider these here—any good sampling textbook should cover these issues.
Sometimes, even when we know that a sample has not been selected randomly, we can still model it. Then, we may not be able to formally infer about a population from the sample, but we can still model the underlying structure of the sample. One example would be a convenience sample—a sample selected more for reasons of convenience than for its statistical properties. When modeling such samples, any results should be reported with a caution about restricting any conclusions to objects similar to those in the sample. Another kind of example is when the sample comprises the whole population. For example, we could model data for all 50 states of the United States of America to better understand any patterns or systematic associations among the states.
Since the real world can be extremely complicated (in the way that data values vary or interact together), models are useful because they simplify problems so that we can better understand them (and then make more effective decisions). On the one hand, we therefore need models to be simple enough that we can easily use them to make decisions, but on the other hand, we need models that are flexible enough to provide good approximations to complex situations. Fortunately, many statistical models have been developed over the years that provide an effective balance between these two criteria. One such model, which provides a good starting point for the more complicated models we consider later, is the normal distribution.
From a statistical perspective, a distribution (strictly speaking, a probability distribution) is a theoretical model that describes how a random variable varies. For our purposes, a random variable represents the data values of interest in the population, for example, the sale prices of all single‐family homes in our housing market. One way to represent the population distribution of data values is in a histogram, as described in Section 1.1. The difference now is that the histogram displays the whole population rather than just the sample. Since the population is so much larger than the sample, the bins of the histogram (the consecutive ranges of the data that comprise the horizontal intervals for the bars) can be much smaller than in Figure 1.1. For example, Figure 1.2 shows a histogram for a simulated population of sale prices. The scale of the vertical axis now represents proportions (density) rather than the counts (frequency) of Figure 1.1.
Figure 1.2 Histogram for a simulated population of sale prices, together with a normal density curve.
As the population size gets larger, we can imagine the histogram bars getting thinner and more numerous, until the histogram resembles a smooth curve rather than a series of steps. This smooth curve is called a density curve and can be thought of as the theoretical version of the population histogram. Density curves also provide a way to visualize probability distributions such as the normal distribution. A normal density curve is superimposed on Figure 1.2. The simulated population histogram follows the curve quite closely, which suggests that this simulated population distribution is quite close to normal.
To see how a theoretical distribution can prove useful for making statistical inferences about populations such as that in our home prices example, we need to look more closely at the normal distribution. To begin, we consider a particular version of the normal distribution, the standard normal, as represented by the density curve in Figure 1.3. Random variables that follow a standard normal distribution have a mean of 0 (represented in Figure 1.3 by the curve being symmetric about 0, which is under the highest point of the curve) and a standard deviation of 1 (represented in Figure 1.3 by the curve having a point of inflection—where the curve bends first one way and then the other—at and ). The normal density curve is sometimes called the “bell curve” since its shape resembles that of a bell. It is a slightly odd bell, however, since its sides never quite reach the ground (although the ends of the curve in Figure 1.3 are quite close to zero on the vertical axis, they would never actually quite reach there, even if the graph were extended a very long way on either side).
Figure 1.3 Standard normal density curve together with a shaded area of between and , which represents the probability that a standard normal random variable lies between and .
The key feature of the normal density curve that allows us to make statistical inferences is that areas under the curve represent probabilities. The entire area under the curve is one, while the area under the curve between one point on the horizontal axis (, say) and another point (, say) represents the probability that a random variable that follows a standard normal distribution is between and . So, for example, Figure 1.3 shows there is a probability of 0.475 that a standard normal random variable lies between and , since the area under the curve between and is 0.475.
We can obtain values for these areas or probabilities from a variety of sources: tables of numbers, calculators, spreadsheet or statistical software, Internet websites, and so on. In this book, we print only a few select values since most of the later calculations use a generalization of the normal distribution called the “t‐distribution.” Also, rather than areas such as that shaded in Figure 1.3, it will become more useful to consider “tail areas” (e.g., to the right of point ), and so for consistency with later tables of numbers, the following table allows calculation of such tail areas: Normal distribution probabilities (tail areas) and percentiles (horizontal axis values)
Upper‐tail area | 0.1 | 0.05 | 0.025 | 0.01 | 0.005 | 0.001 |
Horizontal axis value | 1.282 | 1.645 | 1.960 | 2.326 | 2.576 | 3.090 |
Two‐tail area | 0.2 | 0.1 | 0.05 | 0.02 | 0.01 | 0.002 |
In particular, the upper‐tail area to the right of 1.960 is 0.025; this is equivalent to saying that the area between 0 and 1.960 is 0.475 (since the entire area under the curve is 1 and the area to the right of 0 is 0.5). Similarly, the two‐tail area, which is the sum of the areas to the right of 1.960 and to the left of −1.960, is two times 0.025, or 0.05.
How does all this help us to make statistical inferences about populations such as that in our home prices example? The essential idea is that we fit a normal distribution model to our sample data and then use this model to make inferences about the corresponding population. For example, we can use probability calculations for a normal distribution (as shown in Figure 1.3) to make probability statements about a population modeled using that normal distribution—we will show exactly how to do this in Section 1.3. Before we do that, however, we pause to consider an aspect of this inferential sequence that can make or break the process. Does the model provide a close enough approximation to the pattern of sample values that we can be confident the model adequately represents the population values? The better the approximation, the more reliable our inferential statements will be.
We saw in Figure 1.2 how a density curve can be thought of as a histogram with a very large sample size. So one way to assess whether our population follows a normal distribution model is to construct a histogram from our sample data and visually determine whether it “looks normal,” that is, approximately symmetric and bell‐shaped. This is a somewhat subjective decision, but with experience you should find that it becomes easier to discern clearly nonnormal histograms from those that are reasonably normal. For example, while the histogram in Figure 1.2 clearly looks like a normal density curve, the normality of the histogram of 30 sample sale prices in Figure 1.1 is less certain. A reasonable conclusion in this case would be that while this sample histogram is not perfectly symmetric and bell‐shaped, it is close enough that the corresponding (hypothetical) population histogram could well be normal.
An alternative way to assess normality is to construct a QQ‐plot (quantile–quantile plot), also known as a normal probability plot, as shown in Figure 1.4 (see computer help #22 in the software information files available from the book website). If the points in the QQ‐plot lie close to the diagonal line, then the corresponding population values could well be normal. If the points generally lie far from the line, then normality is in question. Again, this is a somewhat subjective decision that becomes easier to make with experience. In this case, given the fairly small sample size, the points are probably close enough to the line that it is reasonable to conclude that the population values could be normal.
Figure 1.4 QQ‐plot for the home prices example.
There are also a variety of quantitative methods for assessing normality—brief details and references are provided in Section 3.4.2.
Optional—technical details of QQ‐plots
For the purposes of this book, the technical details of QQ‐plots are not too important. For those that are curious, however, a brief description follows. First, calculate a set of equally spaced percentiles (quantiles) from a standard normal distribution. For example, if the sample size, , is 9, then the calculated percentiles would be the 10th, 20th, , 90th. Then construct a scatterplot with the observed data values ordered from low to high on the vertical axis and the calculated percentiles on the horizontal axis. If the two sets of values are similar (i.e., if the sample values closely follow a normal distribution), then the points will lie roughly along a straight line. To facilitate this assessment, a diagonal line that passes through the first and third quartiles is often added to the plot. The exact details of how a QQ‐plot is drawn can differ depending on the statistical software used (e.g., sometimes the axes are switched or the diagonal line is constructed differently).