Читать книгу Design and Analysis of Experiments by Douglas Montgomery - Heath Rushing - Страница 9
Оглавление2
Simple Comparative Experiments
Section 2.2 Basic Statistical Concepts
Section 2.4.1 Hypothesis Testing
Section 2.4.3 Choice of Sample Size
Section 2.5.1 The Paired Comparison Problem
Section 2.5.2 Advantages of the Paired Comparison Design
The problem of testing the effect of a single experimental factor with only two levels provides a useful introduction to the statistical techniques that will later be generalized for the analysis of more complex experimental designs. In this chapter, we develop techniques that will allow us to determine the level of statistical significance associated with the difference in the mean responses of two treatment levels. Rather than only considering the difference between the mean responses across the treatments, we also consider the variation in the responses and the number of runs performed in the experiment. Using a t-test, we are able to quantify the likelihood (expressed as a p-value) that the observed treatment effect is merely noise. A “small” p-value (typically taken to be one smaller than α = 0.05) suggests that the observed data are not likely to have occurred if the null hypothesis (of no treatment effect) were true.
A related question involves the likelihood that the null hypothesis is rejected given that it is false (the power of the test). Given a fixed significance level, α (our definition of what constitutes a “small” p-value), theorized values for the pooled standard deviation, and a minimum threshold difference in treatment means, it is possible to solve for the minimum sample size that is necessary to achieve a desired power. This procedure is useful for determining the number of runs that must be included in a designed experiment.
In the first example presented in this chapter, a scientist has developed a modified cement mortar formulation that has a shorter cure time than the unmodified formulation. The scientist would like to test if the modification has affected the bond strength of the mortar. To study whether the two formulations, on average, produce bonds of different strengths, a two-sided t-test is used to analyze the observations from a randomized experiment with 10 measurements from each formulation. The null hypothesis of this test is that the mean bond strengths produced by the two formulations are equal; the alternative hypothesis is that mean bond strengths are not equal.
We also consider the advantages of a paired t-test, which provides an introduction to the notion of blocking. This test is demonstrated using data from an experiment to test for similar performance of two different tips that are placed on a rod in a machine and pressed into metal test coupons. A fixed pressure is applied to the tip, and the depth of the resulting depression is measured. A completely randomized design would apply the tips in a random order to the test coupons (making only one measurement on each coupon). While this design would produce valid results, the power of the test could be increased by removing noise from the coupon-to-coupon variation. This may be achieved by applying both tips to each coupon (in a random order) and measuring the difference in the depth of the depressions. A one-sample t-test is then used for the null hypothesis that the mean difference across the coupons is equal to 0. This procedure reduces experimental error by eliminating a noise factor.
This chapter also includes an example of procedures for testing the equality of treatment variances, and a demonstration of the t-test in the presence of potentially unequal group variances. This final test is still valid when the group variances are equal, but it is not as powerful as the pooled t-test in such situations.
Section 2.2 Basic Statistical Concepts
1. Open Tension-Bond.jmp.
2. Select Analyze > Distribution.
3. Select Strength for Y, Columns.
4. Select Mortar for By. As we will see in later chapters, these fields will be automatically populated for data tables that were created in JMP.
5. Click OK.
6. Click the red triangle next to Distributions Mortar=Modified and select Uniform Scaling.
7. Repeat step 6 for Distributions Mortar=Unmodified.
8. Click the red triangle next to Distributions Mortar=Modified and select Stack.
9. Repeat step 8 for Distributions Mortar=Unmodified.
10. Hold down the Ctrl key and click the red triangle next to Strength. Select Histogram Options > Show Counts. Holding down Ctrl applies the command to all of the histograms created by the Distribution platform; it essentially “broadcasts” the command.
It appears from the overlapped histograms that the unmodified mortar tends to produce stronger bonds than the modified mortar. The unmodified mortar has a mean strength of 17.04 kgf/cm2 with a standard deviation of 0.25 kgf/cm2. The modified mortar has a mean strength of 16.76 kgf/cm2 with a standard deviation of 0.32 kgf/cm2. A naïve comparison of mean strength indicates that the unmodified mortar outperforms the modified mortar. However, the difference in means could simply be a result of sampling fluctuation. Using statistical theory, our goal is to incorporate the sample standard deviations (and sample sizes) to quantify how likely it is that the difference in mean strengths is due only to sampling error. If it turns out to be unlikely, we will conclude that a true difference exists between the mortar strengths.
11. Select Analyze > Fit Y by X.
12. Select Strength for Y, Response and Mortar for X, Grouping.
The Fit Y by X platform recognizes this as a one-way ANOVA since the response, Strength, is a continuous factor, and the factor Mortar is a nominal factor. When JMP is used to create experimental designs, it assigns the appropriate variable type to each column. For imported data, JMP assigns a modeling type—continuous , ordinal , or nominal —to each variable based on attributes of that variable. A different modeling type may be specified by right-clicking the modeling type icon next to a column name and selecting the new type.
13. Click OK.
14. To create box plots, click the red triangle next to One-way Analysis of Strength by Mortar and select Quantiles.
The median modified mortar strength (represented by the line in the middle of the box) is lower than the median unmodified mortar strength. The similar length of the two boxes (representing the interquartile ranges) indicates that the two mortar formulations result in approximately the same variability in strength.
15. Keep the Fit Y by X platform open for the next exercise.
Section 2.4.1 Hypothesis Testing
1. Return to the Fit Y by X platform from the previous exercise.
2. Click the red triangle next to One-way Analysis of Strength by Mortar and select Means/Anova/Pooled t.
The t-test report shows the two-sample t-test assuming equal variances. Since we have a two-sided alternative hypothesis, we are concerned with the p-value labeled Prob > |t|= 0.0422. Since we have set α=0.05, we reject the null hypothesis that the mean strengths produced by the two formulations of mortar are equal and conclude that the mean strength of the modified mortar and the mean strength of the unmodified mortar are (statistically) significantly different. In practice, our next step would be to decide from a subject-matter perspective if the difference is practically significant.
Before accepting the conclusion of the t test, we should use diagnostics to check the validity of assumptions made by the model. Although this step is not shown for every example in the text, it is an essential part of every analysis. For example, a quantile plot may be used to check the assumptions of normality and identical population variances. Though not shown here, a plot of the residuals against run order could help identify potential violations of the assumed independence across runs (the most important of the three assumptions).
3. Click the red triangle next to One-way Analysis of Strength by Mortar and select Normal Quantile Plot > Plot Quantile by Actual.
The points fall reasonably close to straight lines in the plot, suggesting that the assumption of normality is reasonable. The slopes of the lines are proportional to the standard deviations in each comparison group. These slopes appear to be similar, supporting the decision to assume equal population variances.
4. Select Window > Close All.
Section 2.4.3 Choice of Sample Size
1. To determine the necessary sample size for a proposed experiment, select DOE > Sample Size and Power.
2. Click Two Sample Means.
3. Enter 0.25 for Std Dev, 0.5 for Difference to detect, and 0.95 in Power. Notice that the Difference to detect requested here is the actual difference between group means, not the scaled difference, δ, described in the textbook.
4. Click Continue. A value of 16 then appears in Sample Size. Thus, we should allocate 8 observations to each treatment (n1 = n2 = 8).
5. Suppose we use a sample size of n1 = n2 = 10. What is the power for detecting difference of 0.25 kgf/cm2? Delete the value 0.95 from the Power field, change Difference to detect to 0.25, and set Sample Size to 20.
6. Click Continue.
The power has dropped to 0.56. That is, if the model assumptions hold and the true pooled standard deviation is 0.25, only 56% of the experiments (assuming that we repeat this experiment several times) with 10 measurements from each group would successfully detect the difference of 0.25 kgf/cm2. What sample size would be necessary to achieve a power of 0.9 for this specific difference to detect?
7. Clear the Sample Size field and enter 0.9 for Power.
8. Click Continue.
The required total sample size is 45. This means that we need at least 22.5 observations per group. Rounding up, we see that we need at least 23 observations from each group to achieve a power of at least 0.9. We could have left the Power field blank, specifying only that the Difference to detect is 0.25. The Sample Size and Power platform would then have produced a power curve, displaying Power as a function of Sample Size.
9. Select Window > Close All.
Example 2.1 Hypothesis Testing
1. Open Fluorescence.jmp.
2. Click Analyze > Fit Y by X.
3. Select Fluorescence for Y, Response and Tissue for X, Factor.
4. Click OK.
5. Click the red triangle next to One-way Analysis of Fluorescence by Tissue and select Normal Quantile Plot > Plot Quantile by Actual.
Since the slopes of the lines in the normal quantile plots are proportional to the sample standard deviations of the treatments, the difference between the slopes of the lines for Muscle and Nerve indicates that the variances may be different between the groups. As a result, we will use a form of the t-test that does not assume that the population variances are equal. Formal testing for the equality of the treatment variances is illustrated in Example 2.3 at the end of this chapter.
6. Click the red triangle next to One-way Analysis of Fluorescence by Tissue and select t-Test.
The p-value for the one-sided hypothesis test is 0.0073, which is less than the set α of 0.05. We therefore reject the null hypothesis and conclude that the mean normalized fluorescence for nerve tissue is greater than the mean normalized fluorescence for muscle tissue. Subject matter knowledge would need to determine if there is a practical difference; confidence intervals for the differences (reported in JMP) can be beneficial for this assessment.
7. Select Window > Close All.
Section 2.5.1 The Paired Comparison Problem
1. Open Hardness-Testing.jmp
2. Select Analyze > Matched Pairs.
3. Select Tip 1 and Tip 2 for Y, Paired Response.
4. Click OK.
The p-value, Prob > |t| = 0.7976, indicates that there is no evidence of a difference in the performance of the two tips. This p-value is larger than the standard significance level of α = 0.05.
5. Leave Hardness-Testing.jmp open for the next exercise.
Section 2.5.2 Advantages of the Paired Comparison Design
1. Return to the Hardness-Testing table opened in the previous example.
2. Select Tables > Stack. This will create a file in long format with one observation per row. Most JMP platforms expect data to appear in long format.
3. Select Tip 1 and Tip 2 for Stack Columns.
4. Type “Depth” in the Stacked Data Column field.
5. Type “Tip” in the Source Label Column field.
6. Type “Hardness-Stacked” in the Output table name field.
7. Click OK.
8. Hardness-Stacked is now the current data table. Select Analyze > Fit Y by X.
9. Select Depth for Y, Response and Tip for X, Grouping.
10. Click OK.
11. Click the red triangle next to One-way Analysis of Depth by Tip and select Means/Anova/Pooled t.
The root mean square error of 2.315407 is the pooled standard deviation estimate from the t-test. Compared to the standard deviation estimate of 1.20 from the paired difference test, we see that blocking has reduced the estimate of variability considerably. Though we do not work through the details here, it would be possible to perform this same comparison for the Fluorescence data from Example 2.1.
12. Leave Hardness-Stacked.jmp and the Fit Y by X output window open for the next exercise.
Example 2.3 Testing for the Equality of Variances
This example demonstrates how to test for the equality of two population variances. Section 2.6 of the textbook also discusses hypothesis testing for whether the variance of a single population is equal to a given constant. Though not shown here, the testing for a single variance may be performed in the Distribution platform.
1. Return to the Fit Y by X platform from the previous example.
2. Click the red triangle next to One-way Analysis of Depth by Tip and select Unequal Variances.
3. Save Hardness-Stacked.jmp.
The p-value for the F test (described in the textbook) for the null hypothesis of equal variances (with a two-sided alternative hypothesis) is 0.8393. The data do not indicate a difference with respect to the variances of depth produced from Tip 1 versus Tip 2. Due to the use of a slightly different data set, the F Ratio of 1.1492 reported here is different from the ratio of 1.34 that appears in the book. Furthermore, the textbook uses a one-sided test with an alternative hypothesis. That hypothesis is that the variance of the depth produced by Tip 1 is greater than that produced by Tip 2. Since the sample standard deviation from Tip 1 is greater than that from Tip 2, the F Ratios for the one- and two-sided tests are both equal to 1.1492, but the p-value for the one-sided test would be 0.4197.
It is important to remember that the F test is extremely sensitive to the assumption of normality. If the population has heavier tails than a normal distribution, this test will reject the null hypothesis (that the population variances are equal) more often than it should. By contrast, the Levene test is robust to departures from normality.
4. Select Window > Close All.