Читать книгу SAS Statistics by Example - Ron Cody EdD - Страница 7
ОглавлениеChapter 1 An Introduction to SAS
Statistical Tasks Performed by SAS
Variable Types in SAS Data Sets
Temporary versus Permanent SAS Data Sets
Creating a SAS Data Set from Raw Data
Data Values Separated by Delimiters
Excel Files with Invalid SAS Variable Names
Introduction
If you are reading this book, you are probably familiar with various statistical techniques but might not have used SAS to analyze data. The primary purpose of this book is to show you how to use SAS to perform a variety of statistical tasks. To that end, this book provides examples of many of the commonly used statistical techniques. Following each example is a discussion of the output. Although this is not a book about SAS programming, many of the examples require some data manipulation tasks, which will be described. If you need to gain more SAS programming skills, see Learning SAS by Example: A Programmer’s Guide, also by this author and published by SAS Press.
This book is divided into five sections: An Introduction to SAS, Descriptive Statistics, Inferential Statistics, Power/Sample size calculations, and Selecting Random Samples.
All of the programs and data files in this book are available from SAS Press. To download these programs and files, go to http://support.sas.com/authors.
If you already have some familiarity with SAS data sets and how to run SAS programs, you can skip this chapter and start right in with Chapter 2.
The remainder of this chapter describes what SAS is, the basic structure of SAS programs, how to access some simple data sets, and how to run a SAS program on a Windows platform.
What is SAS
SAS (pronounced sass) is a collection of programs that are used to read data from a variety of sources (text files, Excel workbooks, various databases, etc.), to manipulate data with a very powerful programming language, and to perform various reporting and data analysis tasks. To run all of the examples in this book, you will need access to Base SAS, SAS/STAT, and SAS/GRAPH software.
SAS runs on many different computing platforms: PCs, UNIX and Linux operating systems, and mainframe computers. You can run the examples in this book on any of these platforms, although there are some small differences in how the data are accessed on the various systems. The examples in this book were all run on a PC, because this is by far the most popular platform on which SAS is run.
Statistical Tasks Performed by SAS
SAS, like many other statistical packages, has built-in procedures for analyzing data. These procedures (called PROCs for short) enable you to perform statistical tests and analyses. For example, if you want to perform a Student’s t-test, you use PROC TTEST, to run a regression model, you use PROC REG. The list of statistical PROCs is quite extensive, and not all of them are covered in this book. Even so, this book is a good place to get started, and many of the most popular statistical tasks are covered. The complete guide to all SAS/STAT procedures is available from SAS Institute. The documentation is available in more than five volumes (taking up about two feet of shelf space) or for free in HTML or PDF form on the SAS Web site: http://support.sas.com/documentation. Select SAS Press books are also available on iPad, Kindle, Google eBooks, Mobipocket, Books24x7, netLibrary, and Safari Books online.
The Structure of SAS Programs
SAS programs are divided into DATA and PROC steps. With DATA steps you can read text data from files; create new variables from existing variables; perform logical operations on your data; and merge, concatenate, and subset your data files. PROC steps give you the ability to perform pre-defined tasks such as creating frequency distributions or performing a t-test. SAS stores data in data sets that are unique to SAS. Your data might already be in a SAS data set, in which case you might not need to write a DATA step at all. However, even if you are starting with data already in a SAS data set, you might want to write a DATA step to perform some manipulation of the data, such as performing a transformation, grouping values, subsetting your data, or combining data from several data sets.
SAS Data Sets
SAS data sets consist of two parts: a descriptor portion, or metadata (information about your data set, such as the variable names and data types), and the data values themselves. SAS can create SAS data sets from almost any source. If you have raw data in a text file, you can use a DATA step to read the file and create your SAS data set. If you have an Excel workbook, or any one of several popular database formats, you can either use a SAS procedure to convert the data into a SAS data set or use the Import Wizard (part of SAS/ACCESS and available from SAS Institute) to point-and-click your way through the conversion.
SAS Display Manager
This section shows you how to run SAS programs on a Windows platform. If you are using UNIX or a mainframe to run your programs, your screens will not look like the images shown in this section. If you choose to use SAS Enterprise Guide to run your SAS programs on a Windows platform, you will also be using a different editor.
When you open SAS on a PC platform, you enter SAS Display Manager. This facility contains three major windows: the Program Editor, where you write, edit, and submit your SAS programs; a Log window, where you see error messages and information about the program you have submitted; and the Output window, where SAS displays your output. If you need the output for a Web page or you want to use the output in a Word or Word Perfect program, the Output Delivery System (ODS) can send your output to these destinations as HTML, RTF (rich text format), or PDF.
The following display shows SAS Display Manager with all three windows open:
The top section is the Output window, the middle section is the Log window, and the bottom section is the Program Editor window. You can resize, move, or expand each of these windows.
Excel Workbooks
Because Excel workbooks (or comma-separated values [CSV] files) are so popular on PCs, let’s examine how to use the SAS Import Wizard to convert these files into SAS data sets.
The following display shows an Excel workbook called SAMPLE.XLS:
Each of the columns in this workbook contains information about each of our subjects (ID, Age, and Gender). The first row of the workbook contains variable names. In Excel, these names can be anything. In SAS, variable names must conform to a stricter naming convention. The maximum length of a SAS variable name is 32 characters; the first character of a variable name must be a letter or an underscore, and the remaining characters of the variable name can contain letters (uppercase or lowercase), numbers, and the underscore character. For example, Age, Income2010, and Home_Runs are valid SAS variable names; 1year, Year Income, and Cost% are not valid SAS variable names. Later you will see what happens if the first row of your workbook contains variable names that are not valid SAS variable names.
By the way, SAS variable names are not case sensitive. However, if you use uppercase, lowercase, or mixed case, SAS remembers the case of the variable name from the first time you used it and displays the names in SAS reports based on the previous value.
Each of the rows of the workbook, with the exception of the first row, contains data about an individual. SAS calls these rows observations. So, whereas an Excel workbook has columns and rows, SAS data sets have variables and observations.
To convert this Excel file into a SAS data set, you can use the Import Wizard:
1. Click File.
2. Select Import Data.
3. Choose Microsoft Excel.
4. Select the Excel workbook that you want to convert.
5. Name the SAS data set (SampleData in this example).
6. Click Finish (at the bottom right of the window) to complete the conversion.
Naming conventions for SAS data sets are the same as for SAS variable names. The names must be 32 characters or less in length, they must start with a letter or an underscore, and the remaining characters must be letters, numbers, or underscores.
Now that you have converted your Excel workbook into a temporary SAS data set, you can list the observations in the data set and inspect the descriptor portion of the data set. SAS provides you with several ways to do this.
One way to see a listing of the data in a SAS data set is to use a SAS procedure called PROC PRINT. The following program demonstrates how to use PROC PRINT to list the observations in the SampleData data set:
Program 1.1: Using PROC PRINT to List the Observations in a SAS Data Set
proc print data=SampleData; run; |
Amazingly enough, this is a complete SAS program. Notice that each statement in this two-line SAS program ends in a semicolon. When you write SAS programs, you can use as many lines as you want to write a statement; you can even put more than one statement on a line (though this is not recommended for stylistic reasons). The semicolon is the logical end of a SAS statement. You are free to add extra spaces on a line or place extra blank lines in your program to make it more readable.
To run this program from Display Manager, click the Submit icon:
Here is the output you get from running Program 1.1:
At the top of the three right-most columns, you see the SAS variable names—the same names that were stored in the first row of your workbook. The first column, labeled Obs (short for Observations), was generated by SAS and shows the observation number.
Each row of the listing represents a row from the workbook.
Next, let’s see how to display the data descriptor portion of this data set. Program 1.2 is one way to do this:
Program 1.2: Using PROC CONTENTS to Display the Data Descriptor Portion of a SAS Data Set
title “Displaying the Descriptor Portion of a SAS Data Set”; proc contents data=SampleData; run; |
Notice that I have added a TITLE statement to this program. With a TITLE statement, you can enter a title that will print across the top of every page of output. TITLE statements are in a class of SAS statements known as GLOBAL statements. The title that you enter stays in effect for the remainder of your SAS session, unless you replace it with another TITLE statement. To remove all titles from your output, submit a null title statement like this:
title;
When you submit Program 1.2, you will see the following output:
The first two lines of output show that the data set name is SAMPLEDATA. (The full name is WORK.SAMPLEDATA. The prefix WORK. tells SAS that this is a temporary SAS data set.) Also shown in these lines are the number of observations (5) and the number of variables (3). Let’s skip down to the portion of the output labeled Alphabetic List of Variables and Attributes. Here you see that the variables Age and ID are stored as numeric types and Gender is stored as a character type.
Variable Types in SAS Data Sets
SAS has only two variable types: numeric and character. By default, all numeric values are stored in 8 bytes, allowing for approximately 15 significant figures, depending on your operating system. Character values are stored 1 byte per character and can be from 1 to 32,767 bytes in length.
Temporary versus Permanent SAS Data Sets
SAS data sets can be either temporary or permanent. A temporary SAS data set is one that exists for the duration of your SAS session but is not saved when you exit SAS. Permanent SAS data sets, as the name implies, remain when you exit SAS and can be accessed in future SAS sessions. The Import Wizard example discussed previously used the Work library. Choosing the Work library caused the SAS data set SAMPLEDATA to be a temporary data set.
SAS data set identifiers are divided into two parts, separated by a period. The part before the period is called a library reference (libref for short) and identifies the folder where SAS has stored the data set. The part following the period is the data set name. Both parts of this identifier must satisfy the naming conventions mentioned earlier.
For example, if your data set is called SURVEY and is stored in a library called MYDATA, SAS uses the following notation to identify the file:
mydata.survey
If you wanted to put this file on your disk drive in the C:\MYSASFILES folder, you would write a statement called a LIBNAME statement that associates the c:\sasfiles folder with the MYDATA library reference, like this:
libname mydata “’c:\mysasfiles”’;
Creating a SAS Data Set from Raw Data
If you have your data in a text file, SAS can read the text file and create a SAS data set. The text file can contain either data values separated by delimiters or data values in fixed columns.
Data Values Separated by Delimiters
SAS can read data values from a text file in which each value is separated from the next value by a delimiter. By default, SAS expects one or more spaces between data values. However, it is easy to specify other delimiters, such as commas. Let’s start by reading a small text file in which spaces are used as delimiters. Here’s a listing of this file:
Raw Data with Blanks as Delimiters: File c:\books\Statistics by Example\delim.txt
1 23 M 2 33 F 3 18 F 4 45 M 5 41 M 6 . F |
In this file, the three data values on each line represent an ID number, Age, and Gender, respectively. Before you write a SAS program to read this text file, notice that ID = 6 has a missing value for her age. Because you have delimited data, you need a way to specify that the Age value is missing for that subject. When you have blanks as delimiters, you can use a period to specify that you have a missing value. In the next example, which uses a CSV file, you do not need to use periods for missing values.
Program 1.3 will read this text file and create a SAS data set called Sample2:
Program 1.3: Reading Data from a Text File That Uses Spaces as Delimiters
data Sample2; infile “’c:\books\statistics by example\delim.txt”’; length Gender $ 1; input ID Age Gender $; run; |
The INFILE statement tells SAS where to look for the text file. Following the keyword INFILE, you place the filename in single or double quotes. The LENGTH statement tells SAS that the variable Gender is character (the dollar sign indicates this) and that you want to store Gender in 1 byte (the 1 indicates this). The INPUT statement lists the variable names in the same order as the values in the text file. Because you already told SAS that Gender is a character variable, the dollar sign following the name Gender on the INPUT statement is not necessary. If you had not included a LENGTH statement, the dollar sign following Gender on the INPUT statement would have been necessary. SAS assumes variables are numeric unless you tell it otherwise.
The RUN statement ends the program. Because this program starts with the keyword DATA, it is called a DATA step. The previous two programs demonstrated PROC steps. SAS programs are typically made up of DATA and PROC steps. Each step ends with a RUN statement.
As you did earlier, you can use PROC PRINT to list the observations in the Sample2 data set (as shown in Program 1.4):
Program 1.4: Using PROC PRINT to List the Observations in Data Set Sample2
title “Listing of Data Set Sample2”; proc print data=Sample2; run; |
Here is the listing:
Reading CSV Files
You can make a very small change to Program 1.3 to read the same data from a CSV file. Following is a listing of such a file:
A CSV Text File: c:\books\Statistics by Example\comma.csv
1,23,M 2,33,F 3,18,F 4,45,M 5,41,M 6,,F |
Notice that you no longer need the period in subject 6 because, in the tradition of CSV files, two commas in a row indicate a missing value.
The only change you need to make to Program 1.3 is to use an option called DSD on the INFILE statement. The DSD option specifies that two consecutive commas represent a missing value and that the default delimiter is a comma. Here is the modified program:
Program 1.5: Reading a CSV File
data Sample2; infile “’c:\books\statistics by example\comma.csv”’ dsd; length Gender $ 1; input ID Age Gender $; run; |
This program produces a SAS data set identical to the one created by Program 1.3.
If your CSV file contains variable names in the first row, then the Import Wizard uses these variable names when it creates the SAS data set. Actually, you can use the Import Wizard even if the first row does not contain variable names. If you do, SAS will name the variables F1, F2, etc. This approach is not recommended.
Data Values in Fixed Columns
You might have a raw text file in which the value for each variable is in a fixed column. SAS has two methods for reading this type of data: column input and formatted input. For column input, you follow each variable name on the INPUT statement with the starting and ending column for that value. If you want to create a character variable, you place a dollar sign between the variable name and the column specifications.
For example, if you have ID data in columns 1–3, Age in columns 4–6, and Gender in column 7 of your raw data file, your input statement might look like this:
input ID $ 1-3 Age 4-6 Gender $ 7;
Stylistically, you might prefer to write this statement on three lines, like this (so that the variable names line up):
input ID $ 1-3
Age 4-6
Gender $ 7;
For formatted input, you specify the starting column for the variable using an at sign (@) (called a column pointer) followed by the starting column number. Next, you put your variable name, followed by a SAS informat—a specification of how to read and interpret the next n columns. An equivalent statement to read the same data for ID, Age, and Gender using formatted input is:
input @1 ID $3.
@4 Age 3.
@7 Gender $1.;
The informat $3. tells SAS to read three columns of character data; the 3. informat says to read three columns of numeric data; the $1. informat says to read one column of character data. The two informats n. and $n., are used to read n columns of numeric and character data, respectively.
The INPUT statement is actually quite powerful and enables you to read both simple and complex data structures. For a complete description of how the INPUT statement works, see Learning SAS by Example: A Programmer’s Guide or one of the other publications available from SAS Press.
Excel Files with Invalid SAS Variable Names
What if your Excel file contains variable names in the first row that are not valid SAS names? Take a look at the following spreadsheet:
Three of the four variable names are not valid SAS variable names because they contain either blanks or invalid characters (percent sign and dashes). What happens when you use the Import Wizard to convert this spreadsheet into a SAS data set? SAS substitutes an underscore character in place of each invalid character in the name. A SAS data set created from this spreadsheet would contain the variables ID, Ht_in_Inches, _Fat, and Wt_in_Lbs.
It is possible to use SAS variable names that contain invalid characters. To include such variables, you need to set a system option called VALIDVARNAMES and refer to the variable names using a special notation. Using such variables is not recommended, however, because doing so creates added complications.
Other Sources of Data
The bottom line is that SAS can read data from just about anywhere. Using the Import Wizard, for example, you can read Excel, Access, CSV, tab-delimited, dBASE, JMP (a SAS product), Lotus, SPSS, Stata, and Paradox files. In addition, SAS can read data from most of the major mainframe database systems such as Oracle and DB2.
Conclusions
You now know how to use the Display Manager or other editor to write your SAS programs, and you know how to read your data from a variety of sources. Now you are ready to start using SAS procedures to analyze your data. In the remaining chapters of this book, you will learn how to create descriptive statistics and how to run most of the commonly used inferential statistical tasks.