Читать книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis - Страница 73
2.28.1 Null Hypothesis Significance Testing (NHST): A Legacy of Criticism
ОглавлениеCriticisms targeted against null hypothesis significance testing have inundated the literature since at least the time Berkson in 1938 brought to light how statistical significance can be easily achieved by simple manipulations of sample size:
I believe that an observant statistician who has had any considerable experience with applying the chi‐square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P's tend to come out small. (p. 526)
Since Berkson, the very best and renown of methodologists have remarked that the significance test is subject to gross misunderstanding and misinterpretation (e.g., see Bakan, 1966; Carver, 1993; Cohen, 1990; Estes, 1997; Loftus, 1991; Meehl, 1978; Oakes, 1986; Shrout, 1997; Wilson, Miller, and Lower, 1967). And though it can be difficult to assess or evaluate whether the situation has improved, there is evidence to suggest that it has not. Few describe the problem better than Gigerenzer in his article Mindless statistics (Gigerenzer, 2004), in which he discusses both the roots and truths of hypothesis testing, as well as how its “statistical rituals” and practices have become far more of a sociological phenomenon rather than anything related to good science and statistics.
Other researchers have found that misinterpretations and misunderstandings about the significance test are widespread not only among students but also among their instructors (Haller and Krauss, 2002). What determines statistical significance and what is it a function of? This is an extremely important question. An unawareness of the determinants of statistical significance leaves the door open to misunderstanding and misinterpretation of the test, and the danger to potentially draw false conclusions based on its results. Too often and for too many, the finding “p < 0.05” simply denotes a “good thing” of sorts, without ever being able to pinpoint what is so “good” about it.
Recall the familiar one‐sample z‐test for a mean discussed earlier:
where the purpose of the test was to compare an obtained sample mean to a population mean μ0 under the null hypothesis that μ = μ0. Sigma, σ, recall is the standard deviation of the population from which the sample was presumably drawn. Recall that in practice, this value is rarely if ever known for certain, which is why in most cases an estimate of it is obtained in the form of a sample standard deviation s. What determines the size of zM, and therefore, the smallness of p? There are three inputs that determine the size of p, which we have already featured in our earlier discussion of statistical power. These three factors are , σ and n. We consider each of these once more, then provide simple arithmetic demonstrations to emphasize how changing any one of these necessarily results in an arithmetical change in zM, and consequently, a change in the observed p‐value.
As a first case, consider the distance . Given constant values of σand n, the greater the distance between and μ0, the larger zMwill be. That is, as the numerator grows larger, the resulting zM also gets larger in size, which as a consequence, decreases p in size. As a simple example, assume for a given research problem that σ is equal to 20 and n is equal to 100. This means that the standard error is equal to 20/, which is equal to 20/10 = 2. Suppose the obtained sample mean were equal to 20, and the mean under the null hypothesis, μ0, were equal to 18. The numerator of zM would thus be 20 – 18 = 2. When 2 is divided by the standard error of 2, we obtain a value for zM of 1.0, which is not statistically significant at p < 0.05.
Now, consider the scenario where the standard error of the mean remains the same at 2, but that instead of the sample mean being equal to 20, it is equal to 30. The difference between the sample mean and the population mean is thus 30 – 18 = 12. This difference represents a greater distance between means, and presumably, would be indicative of a more “successful” experiment or study. Dividing 12 by the standard error of 2 yields a zM value of 6.0, which is highly statistically significant at p < 0.05 (whether for a one‐ or two‐tailed test).
Having the value of zM increase as a result of the distance between and μ0 increasing is of course what we would expect from a test statistic if that test statistic is to be used in any sense to evaluate the strength of the scientific evidence against the null. That is, if our obtained sample mean turns out to be very different than the population mean under the null hypothesis, μ0, we would hope that our test statistic would measure this effect, and allow us to reject the null hypothesis at some preset significance level (in our example, 0.05). If interpreting test statistics were always as easy as this, there would be no misunderstandings about the meaning of statistical significance and the misguided decisions to automatically attribute “worth” to the statement “p < 0.05.” However, as we discuss in the following cases, there are other ways to make zM big or small that do not depend so intimately on the distance between and μ0, and this is where interpretations of the significance test usually run awry.
Consider the case now for which the distance between means, is, as before, equal to 2.0 (i.e., 20 – 18 = 2.0). As noted, with a standard error also equal to 2.0, our computed value of zM came out to be 1.0, which was not statistically significant. However, is it possible to increase the size of zM without changing the observed distance between means? Absolutely. Consider what happens to the size of zM as we change the magnitude of either σ or n, or both. First, we consider how zM is defined in part as a function of σ. For convenience, we assume a sample size still of n = 100. Consider now three hypothetical values for σ: 2, 10, and 20. Performing the relevant computations, observe what happens to the size of zM in the case where σ = 2:
The resulting value for zM is quite large at 10. Consider now what happens if we increase σ from 2 to 10:
Notice that the value of zM has decreased from 10 to 2. Consider now what happens if we increase σ even more to a value of 20 as we had originally:
When σ = 20, the value of zM is now equal to 1, which is no longer statistically significant at p < 0.05. Be sure to note that the distance between means has remained constant. In other words, and this is important, zMdid not decrease in magnitude by altering the actual distance between the sample mean and the population mean, but rather decreased in magnitude only by a change in σ.
What this means is that given a constant distance between means , whether or not zM will or will not be statistically significant can be manipulated by changing the value of σ. Of course, a researcher would never arbitrarily manipulate σ directly. The way to decrease σ would be to sample from a population with less variability. The point is that decisions regarding whether a “positive” result occurred in an experiment or study should not be solely a function of whether one is sampling from a population with small or large variance!
Suppose now we again assume the distance between means to be equal to 2. We again set the value of σ at 2. With these values set and assumed constant, consider what happens to zM as we increase the sample size n from 16 to 49 to 100. We first compute zM assuming a sample size of 16:
With a sample size of 16, the computed value for zM is equal to 4. When we increase the sample size to 49, again, keeping the distance between means constant, as well as the population standard deviation constant, we obtain:
We see that the value of zM has increased from 4 to 6.9 as a result of the larger sample size. If we increase the sample size further, to 100, we get
and see that as a result of the even larger sample size, the value of zM has increased once again, this time to 10. Again, we need to emphasize that the observed increase in zM is occurring not as a result of changing values for or σ, as these values remained constant in our above computations. Rather, the magnitude of zMincreased as a direct result of an increase in sample size, n, alone. In many research studies, the achievement of a statistically significant result may simply be indicative that the researcher gathered a minimally sufficient sample size that resulted in zMfalling in the tail of the z distribution. In other cases, the failure to reject the null may in reality simply indicate that the investigator had insufficient sample size. The point is that unless one knows how n can directly increase or decrease the size of a p‐value, one cannot be in a position to understand, in a scientific sense, what the p‐value actually means, or intelligently evaluate the statistical evidence before them.