Читать книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis - Страница 33

2.2 CHI‐SQUARE DISTRIBUTIONS AND GOODNESS‐OF‐FIT TEST

Оглавление

The chi‐square distribution is given by:


for x > 0, where v are degrees of freedom and Γ is the gamma function.4The chi‐square distribution of a random variable is also equal to the sum of squares of n independent and normally distributed z‐scores (Fisher, 1922b). That is,


The chi‐square distribution plays an important role in mathematical statistics and is associated with a number of tests on model coefficients in a variety of statistical methods. The multivariate analog to the chi‐square distribution is that of the Wishart distribution (see Rencher, 1998, p. 53, for details).

The chi‐square goodness‐of‐fit test is one such statistical method that utilizes the chi‐square test statistic to evaluate the tenability of a null hypothesis. Recall that such a test is suitable for categorical data in which counts (i.e., instead of means, medians, etc.) are computed within each cell of the design. The goodness‐of‐fit test is given by


Table 2.1 Contingency Table for 2 × 2 Design

Condition Present (1) Condition Absent (0) Total
Exposure yes (1) 20 10 30
Exposure no (2) 5 15 20
Total 25 25 50

where Oi and Ei represent observed and expected frequencies, respectively, summed across r rows and c columns.

As a simple example, consider the hypothetical data (Table 2.1), where the frequencies of those exposed to something adverse are related to whether a condition is present or absent. If you are a clinical psychologist, then you might define exposure as, perhaps, a variable such as combat exposure, and condition as posttraumatic stress disorder (if you are not a psychologist, see if you can come up with another example).

The null hypothesis is that the 50 counts making up the entire table are more or less randomly distributed across each of the cells. That is, there is no association between condition and exposure. We can easily test this hypothesis in SPSS by weighting the relevant frequencies by cell total:

exposure condition freq
1.00 0.00 10.00
1.00 1.00 20.00
2.00 0.00 15.00
2.00 1.00 5.00

WEIGHT BY freq. CROSSTABS /TABLES=condition BY exposure /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT /COUNT ROUND CELL.

The output follows in which it is first confirmed that we set up our data file correctly:

Exposure * Condition Crosstabulation
Count
Condition Total
1.00 0.00
Exposure 1.00 20 10 30
2.00 5 15 20
Total 25 25 50

We focus on the Pearson chi‐square test value of 8.3 on a single degree of freedom. It is statistically significant (p = 0.004), and hence we can reject the null hypothesis of no association between condition and exposure group.

Chi‐square Tests
Value df Asymp. Sig. (two‐sided) Exact Sig. (two‐sided) Exact Sig. (one‐sided)
Pearson chi‐square 8.333a 1 0.004
Continuity correctionb 6.750 1 0.009
Likelihood ratio 8.630 1 0.003
Fisher's exact test 0.009 0.004
Linear‐by‐linear association 8.167 1 0.004
No. of valid cases 50

a 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.00.

b Computed only for a 2 × 2 table.

In R, we can easily perform the chi‐square test on this data. We first build the matrix of cell counts, calling it diag.table :

> diag.table <- matrix(c(20, 5, 10, 15), nrow = 2) > diag.table [,1] [,2] [1,] 20 10 [2,] 5 15 > chisq.test(diag.table, correct = F) Pearson's Chi-squared test data: diag.table X-squared = 8.3333, df = 1, p-value = 0.003892

We see that the result in R agrees with what we obtained in SPSS. Note that specifying correct = F (correction = false) negated what is known as Yates' correction for continuity, which involves subtracting 0.5 from positive differences in OE and adding 0.5 to negative differences in OE in an attempt to better make the chi‐square distribution approximate that of a multinomial distribution (i.e., in a crude sense, to help make discrete probabilities more continuous). To adjust for Yates, we can either specify correct = T or simply chisq.test(diag.table) , which will incorporate the correction. With the correction implemented, our p‐value increases from 0.003 to 0.009 (not shown). We notice that this adjustment parallels that made in SPSS by adjusting for continuity. When expected counts per cell are relatively small (a working rule is that they should be at least five in each cell), one can also request Fisher's exact test (see Fisher, 1922a), which we note also mirrors the output generated by SPSS:

> fisher.test(diag.table) Fisher's Exact Test for Count Data data: diag.table p-value = 0.008579 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.466377 26.597383 sample estimates: odds ratio 5.764989

Other useful statistics for contingency tables include the phi coefficient and Cramer's V. Phi, ϕ, is a measure of association for 2 × 2 contingency tables, computed as


where χ2 is the chi‐square statistic calculated on the 2 × 2 table, and n is the total sample size. The maximum ϕ can attain is 1.0, indicating maximal association. ϕ can be computed in SPSS by /statistics = phi and is available in R in the psych package (Revelle, 2015). Cramer's ϕc extends on ϕ in that it allows for contingency tables of greater than 2 × 2. It is included in the /statistics = phi command and also available in R's psych package. It is given by:


where k is the minimum of the number of rows or columns. The relationship between ϕc and ϕ is easily shown for k = 2:


Applied Univariate, Bivariate, and Multivariate Statistics

Подняться наверх