Читать книгу Applied Univariate, Bivariate, and Multivariate Statistics Using Python - Daniel J. Denis - Страница 18
1.4 Estimation of Parameters
ОглавлениеAs has undoubtedly become clear at this point, statistical inference is about estimating parameters. If we regularly dealt with population data, then we would have no need for estimation. We would already know the parameters of the population and could simply describe features of the population using, aptly named, descriptive statistics. Referring again to our COVID-19 example, if researchers actually knew the true proportion of those suffering from the virus in the population, then we would know the parameter and could simply describe it via the population proportion. However, as discussed, we rarely if ever know the true parameters, for the reason that our populations are usually quite large and in some cases, as with the coin, may actually be infinite in size. Hence, since we typically do not know the actual population parameters, we have to resort to using inferential statistics to estimate them. That is, we compute something on a sample and use that as an estimate of the population parameter. It should be remarked that the distinction between descriptive vs. inferential does not entail them to be mutually exclusive. When we compute a statistic on a sample, we can call that both a descriptive statistic as well as an inferential one, so long as we are using it for both purposes. Hence, we may compute the proportion of cases suffering from COVID-19 to be 0.01 in our sample, refer to that as a descriptive statistic because we are “describing the sample,” yet when we use that statistic to infer the true proportion in the population, refer to it as an inferential statistic. Hence, it is best not to get too stuck on the meaning of “descriptive” in this case. An inferential statistic, however, typically always implies we are making an educated guess or inference toward the population.
Estimation in statistics usually operates by one of two types. Point estimation involves estimating a precise value of the parameter. In the case of our COVID-19 example, the precise parameter we are wishing to estimate would be the proportion of cases in the population suffering from the disease. If we obtain the value of 0.01 in the sample, for instance, we use this to estimate the true population proportion. You might think, at first glance, that point estimation is a great thing. However, we alluded to earlier why it can be problematic. How this is so is best exemplified by an example. Suppose you would like to catch a fish and resort to very primitive ways of doing so. You obtain a long stick, sharpen the end of it into a spear, and attempt to throw the spear at a fish wallowing in shallow waters. The spear is relatively sharp, so if it “hits” the fish, you are going to secure the catch. However, even intuitively, without any formality, you know there is a problem with this approach. The problem is that catching the fish will be extremely difficult, because even with a good throw, the probability of the spear hitting the fish is likely to be extremely small. It might even be close to zero. This is even if you are a skilled fisherperson!
So, what is the solution? Build a net of course! Instead of the spear, you choose instead to cast a net at the fish. Intuitively, you widen your probability of catching the fish. This idea of widening the net in this regard is referred to in statistics as interval estimation. Instead of estimating with merely a point (sharp spear), you widen the net in order to increase your chances of catching the fish. Though interval estimation is a fairly broad concept, in practice, one of the most common interval estimators is that of a confidence interval. Hence, when we compute a confidence interval, we are estimating the value of the parameter, but with a wider margin than with a point estimator. Theoretically at least, the margin of error for a point estimator is equal to zero because it allows for no “wiggle room” regarding the location of the parameter. So, what is a good margin of error? Just as the significance level of 0.05 is often used as a default significance level, 95% confidence intervals are often used. A 95% confidence interval has a 0.95 probability of capturing (or “covering”) the true parameter. That is, if you took a bunch of samples and on each computed a confidence interval, 95% of them would capture the parameter. If you computed a 99% interval instead, then 99% of them would capture the parameter.
The following is a 95% confidence interval for the mean for a z-distribution,
ȳ – 1.96σM < μ < ȳ + 1.96σM
where ȳ is the sample mean and σM is the standard error of the mean, and, when unpacked, is equal to (we will discuss this later). Notice that it is the population mean, σ/√n, that is at the center of the interval. However, μ is not the random variable here. Rather, the sample on which ȳ was computed is the random sample. The population parameter μ in this case is assumed to be fixed. What the above confidence interval is saying, probabilistically, is the following:
Over all possible samples, the probability is 0.95 that the range between ȳ – 1.96σM and ȳ + 1.96σM will include the true mean, μ.
Now, it may appear at first glance that increasing the confidence level will lead to a better estimate of the parameter. That is, it might seem that increasing the confidence interval from 95% to 99%, for instance, might provide a more precise estimate. A 99% interval looks as follows:
ȳ – 2.58σM < μ < ȳ + 2.58σM
Notice that the critical values for z are more extreme (i.e. they are larger in absolute value) for the 99% interval than for the 95% one. But, shouldn’t increasing the confidence from 95% to 99% help us “narrow” in on the parameter more sharply? At first, it seems like it should. However, this interpretation is misguided and is a prime example of how intuition can sometimes lead us astray. Increasing the level of confidence, all else equal, actually widens the interval, not narrows it. What if we wanted full confidence, 100%? The interval, in theory, would look as follows:
ȳ – ∞σM < μ < ȳ + ∞σM
That is, we are quite sure the true mean will fall between negative and positive infinity! A truly meaningless statement. The morale of the story is this – if you want more confidence, you are going to have to pay for it with a wider interval. Scientists usually like to use 95% intervals in most of their work, but there is definitely no mathematical principle that says this is the level one should use. For some problems, even a 90% interval might be appropriate. This again highlights the importance of understanding research principles, so that you can appreciate why the research paper features this or that level of confidence. Be critical (in a good way). Ask questions of the research paper.