Читать книгу Applied Regression Modeling - Iain Pardoe - Страница 19

1.4.1 Central limit theorem—normal version

Оглавление

Suppose that a random sample of data values, represented by , comes from a population that has a mean of and a standard deviation of . The sample mean, , is a pretty good estimate of the population mean, . This textbook uses for the sample mean of rather than the traditional (“‐bar”), which, in the author's experience, is unfamiliar and awkward for many students. The very famous sampling distribution of this statistic derives from the central limit theorem. This theorem states that under very general conditions, the sample mean has an approximate normal distribution with mean and standard deviation (under repeated sampling). In other words, if we were to take a large number of random samples of data values and calculate the mean for each sample, the distribution of these sample means would be a normal distribution with mean and standard deviation . Since the mean of this sampling distribution is , is an unbiased estimate of .

An amazing fact about the central limit theorem is that there is no need for the population itself to be normal (remember that we had to assume this for the calculations in Section 1.3). However, the more symmetric the distribution of the population, the better is the normal approximation for the sampling distribution of the sample mean. Also, the approximation tends to be better the larger the sample size .

So, how can we use this information? Well, the central limit theorem by itself will not help us to draw statistical inferences about the population without still having to make some restrictive assumptions. However, it is certainly a step in the right direction, so let us see what kind of calculations we can now make for the home prices example. As in Section 1.3, we will assume that and , but now we no longer need to assume that the population is normal. Imagine taking a large number of random samples of size 30 from this population and calculating the mean sale price for each sample. To get a better handle on the sampling distribution of these mean sale prices, we will find the 90th percentile of this sampling distribution. Let us do the calculation first, and then see why this might be a useful number to know.

First, we need to get some notation straight. In this section, we are not thinking about the specific sample mean we got for our actual sample of 30 sale prices, . Rather we are imagining a list of potential sample means from a population distribution with mean 280 and standard deviation 50—we will call a potential sample mean in this list . From the central limit theorem, the sampling distribution of is normal with mean 280 and standard deviation . Then the standardized ‐value from ,


is standard normal with mean 0 and standard deviation 1. From the normal table in Section 1.2, the 90th percentile of a standard normal random variable is 1.282 (since the horizontal axis value of 1.282 corresponds to an upper‐tail area of 0.1). Then


Thus, the 90th percentile of the sampling distribution of is (to the nearest ). In other words, under repeated sampling, has a distribution with an area of 0.90 to the left of (and an area of 0.10 to the right of ). This illustrates a crucial distinction between the distribution of population ‐values and the sampling distribution of —the latter is much less spread out. For example, suppose for the sake of argument that the population distribution of is normal (although this is not actually required for the central limit theorem to work). Then we can do a similar calculation to the one above to find the 90th percentile of this distribution (normal with mean 280 and standard deviation 50). In particular,


Thus, the 90th percentile of the population distribution of is (to the nearest ). This is much larger than the value we got above for the 90th percentile of the sampling distribution of (). This is because the sampling distribution of is less spread out than the population distribution of —the standard deviations for our example are 9.129 for the former and 50 for the latter. Figure 1.5 illustrates this point.


Figure 1.5 The central limit theorem in action. The upper density curve (a) shows a normal population distribution for with mean and standard deviation : the shaded area is , which lies to the right of the th percentile, . The lower density curve (b) shows a normal sampling distribution for with mean and standard deviation : the shaded area is also , which lies to the right of the th percentile, . It is not necessary for the population distribution of to be normal for the central limit theorem to work—we have used a normal population distribution here just for the sake of illustration.

We can again turn these calculations around. For example, what is the probability that is greater than 291.703? To answer this, consider the following calculation:


So, the probability that is greater than 291.703 is 0.10.

Applied Regression Modeling

Подняться наверх