Читать книгу SAS Statistics by Example - Ron Cody EdD - Страница 8
ОглавлениеChapter 2 Descriptive Statistics – Continuous Variables
Computing Descriptive Statistics Using PROC MEANS
Descriptive Statistics Broken Down by a Classification Variable
Computing a 95% Confidence Interval and the Standard Error
Producing Descriptive Statistics, Histograms, and Probability Plots
Changing the Midpoint Values on the Histogram
Generating a Variety of Graphical Displays of Your Data
Displaying Multiple Box Plots for Each Value of a Categorical Variable
Introduction
One of the first steps in any statistical analysis is to calculate some basic descriptive statistics on the variables of interest. SAS has a number of procedures that provide tabular as well as graphical displays of your data.
To demonstrate some of the ways that SAS can produce descriptive statistics, use a data set called Blood_Pressure. This data set contains the variables Subj (ID value for each subject), Drug (with values of Placebo, Drug A, or Drug B), SBP (systolic blood pressure), DBP (diastolic blood pressure), and Gender (with values of M or F). Here is a listing of the first 25 observations from this data set:
Notice that some of the observations contain missing values, represented by periods for numeric values and blanks for character values
Computing Descriptive Statistics Using PROC MEANS
One way to compute means and standard deviations is to use PROC MEANS. Here is a program to compute some basic descriptive statistics on the two variables SBP and DBP:
Program 2.1: Generating Descriptive Statistics with PROC MEANS
libname example ’c:\books\statistics by example’; title “Descriptive Statistics for SBP and DBP”; proc means data=example.Blood_Pressure n nmiss mean std median maxdec=3; var SBP DBP; run; |
Because the Blood_Pressure data set is a permanent SAS data set (when it was created, it was placed in a folder on a disk drive instead of in a temporary SAS folder that disappears when you end your SAS session), you need a LIBNAME statement to tell SAS where to find the data set. In this example, the data set is located in the c:\books\statistics by example folder. Remember that SAS data set names contain two parts: the part before the period is a library reference (libref for short) that tells SAS where to find the data set, and the part after the period is the actual data set name (in this case, Blood_Pressure). If you were to use your operating system to list the contents of the c:\books\statistics by example folder, you would see a file called:
Blood_Pressure.sas7bdat
This file is the actual SAS data set and contains both the descriptor portion and the individual observations. The extension sas7bdat indicates that the data set is compatible with SAS 7 and later. This file is not a text file, and you cannot view it using a word processor or other Windows programs.
The TITLE statement causes SAS to print the title at the top of every page of output until you change the title or turn off all titles. In this program, the title is placed in double quotes. You can also use single quotation marks (as long as there are no single quotation marks in the title) or, for that matter, no quotation marks at all (SAS is smart enough to realize that the text following a TITLE statement is the title text).
PROC MEANS is a popular SAS procedure that produces a number of useful statistics. In this program, the keyword DATA= tells the procedure that you want to produce descriptive statistics on the Blood_Pressure data set.
You can control what statistics this procedure produces by using procedure options. These options are placed between the procedure name and the semicolon ending the statement, and you can place them in any order. If you omit these options, PROC MEANS will, by default, print the number of nonmissing observations, the mean, standard deviation, the minimum value, and the maximum value.
The first two options in this program, N and NMISS, cause the number of nonmissing and missing values for each variable to be reported. The next three options, MEAN, STD, and MEDIAN, request the mean, standard deviation, and the median to be computed. The last option, MAXDEC=n, specifies how many digits to the right of the decimal point you want in your report. In this program, you are requesting that all the statistics be reported to three decimal places.
The following list describes some of the more useful options:
Option | Description |
n | Number of nonmissing observations |
nmiss | Number of observations with missing values |
mean | Arithmetic mean |
std | Standard deviation |
stderr | Standard error |
min | Minimum value |
max | Maximum value |
median | Median |
maxdec= | Maximum number of decimal places to display |
clm | 95% confidence limit on the mean |
cv | Coefficient of variation |
The VAR statement tells the procedure which variables you want to analyze. If you omit a VAR statement, PROC MEANS produces statistics on all of the numeric variables in the specified data set (usually not a good idea).
Finally, the PROC step ends with a RUN statement. Here is the output:
Descriptive Statistics Broken Down by a Classification Variable
The data set Blood_Pressure also contains a variable called Drug. You might want to see the same statistics, but this time compute them for each level of Drug. One way to do this is to add a CLASS statement to PROC MEANS like this:
Program 2.2: Statistics Broken Down by a Classification Variable
title “Descriptive Statistics Broken Down by Drug”; proc means data=example.Blood_Pressure n nmiss mean std median maxdec=3; class Drug; var SBP DBP; run; |
The CLASS statement tells the procedure to produce the selected statistics for each unique value of Drug. This is a good time to tell you that when you have more than one statement in a PROC step (in this case, the CLASS and VAR statements), the order of these statements does not usually matter. The exceptions are certain statistical procedures in which you must specify your model before you ask for certain statistics.
Here is the output:
You should always request both the N and NMISS options when you run PROC MEANS, because missing values are a possible source of bias.
What if you want to see the grand mean, as well as the means broken down by Drug, all in one listing? The PROC MEANS option PRINTALLTYPES does this for you when you include a CLASS statement. Here is the modified program:
Program 2.3: Demonstrating the PRINTALLTYPES Option with PROC MEANS
title “Descriptive Statistics Broken Down by Drug”; proc means data=example.Blood_Pressure n nmiss mean std median printalltypes maxdec=3; class Drug; var SBP DBP; run; |
Here is the corresponding output:
Now you see statistics for each value of Drug and for all subjects, in the same listing.
Computing a 95% Confidence Interval and the Standard Error
A 95% confidence interval for the mean (often abbreviated as 95% CI) is useful in helping you decide how well your sample mean estimates the mean of the population from which you took your sample. Another measure, the standard error, is also useful for the same reason. This program shows how to compute both:
Program 2.4: Computing a 95% Confidence Interval
title “Computing a 95% Confidence Interval and the Standard Error”; proc means data=example.Blood_Pressure n mean clm stderr maxdec=3; class Drug; var SBP DBP; run; |
In this example, some of the options that were used previously have been omitted to reduce the size of the output. This program also uses the option CLM (confidence limit for the mean) to request the interval. SAS uses this option because the upper and lower bounds on a confidence interval are also referred to as confidence limits. The option STDERR requests that the standard error also be listed in the output, which follows:
Producing Descriptive Statistics, Histograms, and Probability Plots
Another SAS procedure, PROC UNIVARIATE, produces output that is similar to the output from PROC MEANS. However, PROC UNIVARIATE provides additional statements that produce histograms and probability plots.
The following program demonstrates these features of PROC UNIVARIATE:
Program 2.5: Producing Histograms and Probability Plots Using PROC UNIVARIATE
title “Demonstrating PROC UNIVARIATE”; proc univariate data=example.Blood_Pressure; id Subj; var SBP DBP; histogram; probplot / normal(mu=est sigma=est); run; |
Program 2.5 demonstrates a typical use of PROC UNIVARIATE—to produce descriptive statistics and some graphical output. Note that in order to generate the histogram and probability plots, you need to have SAS/GRAPH installed.
The ID statement is not necessary, but it is particularly useful with PROC UNIVARIATE. With this statement, you can specify a variable that identifies each observation. In this example, Subj is the ID variable.
The VAR statement works with PROC UNIVARIATE in the same way that it works with PROC MEANS—it enables you to list the variables that you want to analyze.
The HISTOGRAM statement requests histograms. You can follow the HISTOGRAM statement with a list of variables. If you omit this list of variables, the procedure produces a histogram for every variable that you listed on the VAR statement.
Finally, the PROBPLOT statement requests a probability plot. This plot shows percentiles from a theoretical distribution on the x-axis and data values on the y-axis. This example program selects the normal distribution using the NORMAL option after the forward slash. If your data values are normally distributed, the points on this plot will form a straight line. To make it easier to see deviations from normality, the option NORMAL also produces a reference line where your data values would fall if they came from a normal distribution. When you use the NORMAL option, you also need to specify a mean and standard deviation. Specify these by using the keyword MU= to specify the mean and the keyword SIGMA= to specify a standard deviation. The keyword EST tells the procedure to use the data values to estimate the mean and standard deviation, instead of some theoretical value.
Notice the slash between the word PROBPLOT and NORMAL. Using a slash here follows standard SAS syntax: if you want to specify options for any statement in a PROC step, follow the statement keyword with a slash. (Note: It took the author several years to figure this out for himself.)
To save space, the following output shows only the results for the variable SBP. Each section is presented separately, with a discussion following each section.
The first section of the output contains come useful and some not-so-useful values. For example, you see the number of nonmissing values that were used to compute the statistics (N), mean, and standard deviation.
Also in this section, you see skewness and kurtosis, measures that show deviations from normality. A skewness value of 0 indicates a symmetric distribution about the mean; positive skewness values indicate a right-skewed distribution, and negative values indicate a left-skewed distribution. Left and right refer to the direction in which the elongated tail points. The value -.145 in this listing is very close to 0 and shows that there are no pronounced tails in the distribution of SBP. Kurtosis values indicate whether the distribution is more peaked than or flatter than a normal distribution. The value that SAS computes for kurtosis is scaled so that you get the value 0 for a normal distribution (also known as relative kurtosis). Positive values for kurtosis indicate both that the distribution is too peaked (leptokurtic) and that the tails are too heavy. Negative values for kurtosis indicate that the distribution is too flat (platykurtic) and that the tails are too light. The kurtosis value for SBP (-.535) indicates that the distribution of SBP is reasonably consistent with a normal distribution.
The coefficient of variation (often abbreviated CV) expresses the standard deviation as a percent of the mean. This output shows that the standard deviation is about 8.38% of the mean. Finally, the value at the bottom right of this section is the standard error of the mean (1.46), which gives you an estimate of how accurately this sample has estimated the population mean.
The remaining values in the section are less useful. This author believes that they were originally included so that you could use them in hand calculations of other statistics that were not computed by SAS. The sum of weights is useful only if you use a WEIGHT statement with PROC UNIVARIATE; with a WEIGHT statement you select a variable that weights the SBP values. In this example, because you did not specify any weights, the sum of weights is equal to the number of observations (all the weights are equal to 1). The uncorrected SS is the sum of squares of all the data values. To compute the corrected SS, you subtract the mean from each value before you square them, and then add them up. This value is the same as the numerator of the sample variance.
The values listed in this section are somewhat redundant. They are grouped here for convenience as measures of location (mean, median, and mode) and measures of variability (standard deviation, variance, range, and interquartile range).
This section displays a number of statistical tests that determine whether various measures of central location are significantly different from a theoretical value (mu). The default value for mu is mu=0. You can change the default value to another value by using the procedure option MU=n, where n is the nonzero value of your choice.
The tests listed in this section are a one-sample Student’s t-test, a sign test, and a signed-rank test (also known as the Wilcoxon signed-rank test). These statistics are discussed in Chapter 5 (one-sample t-test) and Chapter 12 (the sign and Wilcoxon tests).
Continuing the examination of the PROC UNIVARIATE output, you see a list of commonly used quantiles. The most useful values are the lowest value (0% Min), first quartile (25% Q1), median (50% Median), third quartile (75% Q3), and the maximum value (100% Max). If you supply PROC UNIVARIATE with some options, it can compute quantiles for any values you want, and write these values to a SAS data set.
This section displays the five lowest and five highest values in your data set. You can quickly check the listed values to ensure that no values are dramatically different from what you expected (perhaps a data entry error occurred).
Because you used an ID statement, this portion of the output includes the Subj variable. The column labeled Obs is the observation number (which is not very useful because adding observations or sorting the data set will change the observation number). If you want to see more than the five lowest and five highest values, you can supply a procedure option NEXTROBS=n (number of extreme observations) to ask PROC UNIVARIATE to list any number of extreme observations.
This section tells you how many observations had a missing value for the variable of interest. It also expresses this number as the percent of all your observations.
The HISTOGRAM and PROBPLOT statement both produce high quality SAS/GRAPH output. Depending on your system, these plots are either displayed immediately in your output window, or you need to click on the task bar at the bottom of your screen to see them. The following graph is the result of the HISTOGRAM statement:
The x-axis shows ranges of SBP. The numbers that are displayed are the midpoints of the SBP ranges. The y-axis displays the percentage of values that fall within these ranges. In the next section, you will learn how to change these data ranges, but the values that SAS chooses for you are usually fine for a quick idea of what your distribution looks like. In this example, the SBP values look similar to those from a normal distribution.
The PROBPLOT statement produced the next graph:
If your values came from a normal distribution, they would fall close to the diagonal line on the plot. In this example, the actual data points do not deviate much from this theoretical line, showing that the values of SBP come from a distribution that is close to normal. This outcome is also consistent with the values for skewness and kurtosis that you saw earlier.
Changing the Midpoint Values on the Histogram
If you want to change the midpoint values displayed on the histogram, you can supply a MIDPOINTS option on the HISTOGRAM statement. For example, if you want midpoints to go from 100 to 170 with each bin representing 5 points, you would write:
histogram / midpoints=100 to 170 by 5;
The following histogram used the MIDPOINTS option set to 100 to 170 by 5:
Finally, you could also see a theoretical normal curve superimposed on your histogram by including the NORMAL option on the HISTOGRAM statement like this:
histogram / midpoints=100 to 170 by 5 normal;
The output now shows a normal curve superimposed on your histogram:
Generating a Variety of Graphical Displays of Your Data
SAS 9.2 introduced several important and useful statistical graphics procedures. Among the more useful of these are SGPLOT and SGSCATTER. You can use SGPLOT to produce histograms, box plots, scatter plots, and much more. SGSCATTER displays several plots on a single page (including a scatter plot matrix that is particularly useful). The SG procedures come with a number of built-in styles. You can select different styles for your output without having to do any programming.
Let’s see how to produce a histogram and a box plot using SGPLOT.
Program 2.6: Using PROC SGPLOT to Produce a Histogram
title “Using SGPLOT to Produce a Histogram”; proc sgplot data=example.Blood_Pressure; histogram SBP; run; |
This HISTOGRAM statement produces a histogram, similar in appearance to the histogram you obtained with the HISTOGRAM statement on PROC UNIVARIATE. As you will learn later, you can change the appearance of the output when you select alternate output destinations such as HTML, PDF, and RTF (rich text format), and one of the built-in styles.
First, let’s see how to display the plot. Then you will learn a few of the more popular options that control the appearance of the output.
Output from the SG procedures does not usually open automatically after you run the procedure. One way to examine the output is to go to the Results window in SAS Display Manager:
You see the output from SGPLOT with a plus sign (+) to the left of it. Click the plus sign to expand the list:
Now double click on the SGPlot Procedure icon to display the histogram. You can use this sequence of steps to display any of the graphs produced by the SG procedures or to display the plots produced by ODS Statistical Graphics that you will see later in this book.
Finally, after all this clicking, you will see your histogram:
To produce a box-plot of the same data, use the HBOX statement (horizontal box plot) instead of the request for a histogram:
Program 2.7: Using SGPLOT to Produce a Horizontal Box Plot
title “Using SGPLOT to Produce a Box Plot”; proc sgplot data=example.Blood_Pressure; hbox SBP; run; |
Click your way through the Results window to see the following display:
The left and right sides of the box represent the 1st and 3rd quartiles (sometimes abbreviated Q1 and Q3). The vertical bar inside the box is the median, and the diamond represents the mean. The lines extending from the left and right side of the box (called whiskers) represent data values that are less than 1.5 times the interquartile range from Q1 and Q3. If you prefer to see a vertical box plot, use the keyword VBOX instead of HBOX.
To see the effect of outliers on a box plot, let’s modify two SBP values for subjects 5 and 55 to be 200 and 180, respectively. This modified data set is called Blood_Pressure_Out and is stored in the Work library (making it a temporary SAS data set). You can see the program to create this data set, as well as the request for the box plot, in Program 2.8:
Program 2.8: Displaying Outliers in a Box Plot
*Program to make a temporary SAS data set Blood_Pressure_Out that contains two outliers, one for Subj 5, one for Subj 55; data Blood_Pressure_Out; set example.Blood_Pressure(keep=Subj SBP); if Subj = 5 then SBP = 200; else if Subj = 55 then SBP = 180; run; title “Demonstrating How Outliers are Displayed with a Box Plot”; proc sgplot data=Blood_Pressure_Out; hbox SBP; run; |
The SET statement is an instruction to read each of the observations from data set example.Blood_Pressure. In parentheses following the data set name is a KEEP= data set option. This option tells the program that you want only two of the variables (Subj and SBP) to be read from the input data set. Finally, the IF-THEN statement is true when the value of Subj is equal to 5. The assignment statement following the keyword THEN is executed and the SBP value is set to 180. In a similar manner, the ELSE-IF statement sets the value of SBP to 180 for subject 55.
The box plot of the modified data set shows the two outliers as small circles:
You can even get a bit fancier and let SAS label the outliers:
Program 2.9: Labeling Outliers on a Box Plot
title “Demonstrating How Outliers are Displayed with a Box Plot”; proc sgplot data=Blood_Pressure_Out; hbox SBP / datalabel=Subj; run; |
The option DATALABEL= lets you select a variable to identify specific outliers. If you use the DATALABEL option without naming a label variable, SGPLOT uses the numerical value of the response variable (SBP in this example) to label the outliers. Here is the output:
Notice that the outliers for subjects 5 and 55 are labeled.
Displaying Multiple Box Plots for Each Value of a Categorical Variable
If you want to see a box plot for each value of a categorical variable, you can include the option CATEGORY= on the HBOX or VBOX statement. The example that follows uses the original Blood_Pressure data set (without the outliers) and displays a box plot for each value of Drug.
Program 2.10: Displaying Multiple Box Plots for Each Value of a Categorical Variable
title “Box Plots of SBP for Each Value of Drug”; proc sgplot data=example.Blood_Pressure; hbox SBP / category=Drug; run; |
The HBOX option CATEGORY= generates a separate box plot for each of the three Drug values:
Conclusions
Descriptive statistics should be your first step in data analysis so that you can see a summary of the data and better understand their distribution. This chapter showed you how to produce both numerical and graphical output for continuous variables, using a number of SAS procedures.
The next two chapters will show you how to display descriptive statistics for categorical variables and how to investigate bivariate relationships.