Читать книгу Fundamentals of Programming in SAS - James Blum - Страница 9

Оглавление

Chapter 2: Foundations for Analyzing Data and Reading Data from Other Sources

2.1 Learning Objectives

2.2 Case Study Activity

2.3 Getting Started with Data Exploration in SAS

2.3.1 Assigning Labels and Using SAS Formats

2.3.2 PROC SORT and BY-Group Processing

2.4 Using the MEANS Procedure for Quantitative Summaries

2.4.1 Choosing Analysis Variables and Statistics in PROC MEANS

2.4.2 Using the CLASS Statement in PROC MEANS

2.5 User-Defined Formats

2.5.1 The FORMAT Procedure

2.5.2 Permanent Storage and Inspection of Defined Formats

2.6 Subsetting with the WHERE Statement

2.7 Using the FREQ Procedure for Categorical Summaries

2.7.1 Choosing Analysis Variables in PROC FREQ

2.7.2 Multi-Way Tables in PROC FREQ

2.8 Reading Raw Data

2.8.1 Introduction to Reading Delimited Files

2.8.2 More with List Input

2.8.3 Introduction to Reading Fixed-Position Data

2.9 Details of the DATA Step Process

2.9.1 Introduction to the Compilation and Execution Phases

2.9.2 Building blocks of a Data Set: Input Buffers and Program Data Vectors

2.9.3 Debugging the DATA Step

2.10 Validation

2.11 Wrap-Up Activity

2.12 Chapter Notes

2.13 Exercises

2.1 Learning Objectives

At the conclusion of this chapter, mastery of the concepts covered in the narrative includes the ability to:

 Apply the MEANS procedure to produce a variety of quantitative summaries, potentially grouped across several categories

 Apply the FREQ procedure to produce frequency and relative frequency tables, including cross-tabulations

 Categorize data for analyses in either the MEANS or FREQ procedures using internal SAS formats or user-defined formats

 Formulate a strategy for selecting only the necessary rows when processing a SAS data set

 Apply the DATA step to read data from delimited or fixed-position raw text files

 Describe the operations carried out during the compilation and execution phases of the DATA step

 Compare and contrast the input buffer and program data vector

 Apply DATA step statements to assist in debugging

 Apply the COMPARE procedure to compare and validate a data set against a standard

Use the concepts of this chapter to solve the problems in the wrap-up activity. Additional exercises and case-studies are also available to test these concepts.

2.2 Case Study Activity

This section introduces a case study that is used as a basis for most of the concepts and associated activities in this book. The data comes from the Current Population Survey by the Integrated Public Use Microdata Series (IPUMS CPS). IPUMS CPS contains a wide variety of information, only a subset of the data collected from 2001-2015 is included in the examples here. Further, the data used is introduced in various segments, starting with simple sets of variables and eventually adding more information that must be assembled to achieve the objectives of each section.

This chapter works with data that includes household-level information from the 2005 and 2010 IPUMS CPS data sets of over one million observations each. Included are variables on state, county, metropolitan area/city, household income, home value, mortgage status, ownership status, and mortgage payment. Outputs 2.2.1 through 2.2.4 show tabular summaries from the 2010 data, including quantitative statistics, frequencies, and/or percentages. Reproducing these tables in the wrap-up activity in Section 2.11 is the primary objective for this chapter.

The first sample output shown in Output 2.2.1 produces a set of six statistics on mortgage payments across metropolitan status for mortgages of $100 per month or more. In order to make this table, and the slightly more complicated Output 2.2.2, several components of the MEANS procedure must be understood.

Output 2.2.1: Basic Statistics on Mortgage Payments Grouped on Metropolitan Status

Analysis Variable : MortgagePayment Mortgage Payment
MetroNMeanMedianStd DevMinimumMaximum
Not Identifiable42927970.2800.0668.5100.07400.0
Not in Metro Area97603815.0670.0576.0100.06800.0
Metro, Inside City560391363.51100.0974.8100.07400.0
Metro, Outside City1859671480.81300.0974.7100.07400.0
Metro, City Status Unknown1632041233.21000.0846.4100.07400.0

Output 2.2.2: Minimum, Median, and Maximum on Mortgage Payments Across Multiple Categories

MetroHousehold IncomeVariableLabelMinimumMedianMaximum
Metro, Inside CityNegativeMortgagePaymentHomeValueMortgage PaymentHome Value4407000012002500004500675000
$0 to $45KMortgagePaymentHomeValueMortgage PaymentHome Value100074013000068005303000
$45K to $90KMortgagePaymentHomeValueMortgage PaymentHome Value1000100018000074004915000
Above $90KMortgagePaymentHomeValueMortgage PaymentHome Value1000160034000074005303000
Metro, Outside CityNegativeMortgagePaymentHomeValueMortgage PaymentHome Value10010000145025000054004152000
$0 to $45KMortgagePaymentHomeValueMortgage PaymentHome Value100085015000074004304000
$45K to $90KMortgagePaymentHomeValueMortgage PaymentHome Value1000110019900068004915000
Above $90KMortgagePaymentHomeValueMortgage PaymentHome Value1000160033000074004915000
Metro, City Status UnknownNegativeMortgagePaymentHomeValueMortgage PaymentHome Value18017000120024500053002948000
$0 to $45KMortgagePaymentHomeValueMortgage PaymentHome Value100072012500074004915000
$45K to $90KMortgagePaymentHomeValueMortgage PaymentHome Value100096016000074004915000
Above $90KMortgagePaymentHomeValueMortgage PaymentHome Value1000140027000074004915000

In Outputs 2.2.3 and 2.2.4, frequencies and percentages are summarized across combinations of various categories, which requires mastery of the fundamentals of the FREQ procedure.

Output 2.2.3: Income Status Versus Mortgage Payment

Table of HHIncome by MortgagePayment
HHIncome(Household Income)MortgagePayment(Mortgage Payment)
FrequencyRow Pct$350 and Below$351 to $1000$1001 to $1600Over $1600Total
Negative309.939732.129230.468327.48302
$0 to $45K2292916.378312559.332261716.14114368.16140107
$45K to $90K138776.9610366051.995477827.482705213.57199367
Above $90K59442.895267925.586247430.338486741.20205964
Total42780239561139961123438545740

Output 2.2.4: Income Status Versus Mortgage Payment for Metropolitan Households (Table 1 of 3)

Table 1 of HHIncome by MortgagePayment
Controlling for Metro=Metro, Inside City
HHIncome(Household Income)MortgagePayment(Mortgage Payment)
FrequencyRow Pct$350 and Below$351 to $1000$1001 to $1600Over $1600Total
Negative00.00730.43939.13730.4323
$0 to $45K159610.75894960.30259717.50170011.4514842
$45K to $90K9104.75921548.13557129.10345018.0219146
Above $90K5042.29494722.46632128.701025646.5622028
Total301023118144981541356039

2.3 Getting Started with Data Exploration in SAS

This section reviews and extends some fundamental SAS concepts demonstrated in code supplied for Chapter 1, with these examples built upon a simplified version of the case study data. First, Program 2.3.1 uses the CONTENTS and PRINT procedures to make an initial exploration of the Ipums2005Mini data set. To begin, make sure the BookData library is assigned as done in Chapter 1.

Program 2.3.1: Using the CONTENTS and PRINT Procedures to View Data and Attributes

proc contents data=bookdata.ipums2005mini;

ods select variables;

run;

proc print data=bookdata.ipums2005mini(obs=5);

var state MortgageStatus MortgagePayment HomeValue Metro;

run;

 The BookData.Ipums2005Mini data set is a modification of a data set used later in this chapter, BookData.Ipums2005Basic. It subsets the original data set down to a few records and is used for illustration of these initial concepts.

 The ODS SELECT statement limits the output of a given procedure to the chosen tables, with the Variables table from PROC CONTENTS containing the names and attributes of the variables in the chosen data set. Look back to Program 1.4.4, paying attention to the ODS TRACE statement and its results, to review how this choice is made.

 The OBS= data set option limits the number of observations processed by the procedure. It is in place here simply to limit the size of the table shown in Output 2.3.1B. At various times in this text, the output shown may be limited in scope; however, the code given may not include this option for all such cases.

 The VAR statement is used in the PRINT procedure to select the variables to be shown and the column order in which they appear.

Output 2.3.1A: Using the CONTENTS Procedure to View Attributes

Alphabetic List of Variables and Attributes
#VariableTypeLenFormat
4CITYPOPNum8
2COUNTYFIPSNum8
10CityChar43
6HHINCOMENum8
7HomeValueNum8
3METRONum8BEST12.
5MortgagePaymentNum8
9MortgageStatusChar45
11OwnershipChar6
1SERIALNum8
8stateChar57

Output 2.3.1B: Using the PRINT Procedure to View Data

ObsstateMortgageStatusMortgagePaymentHomeValueMETRO
1South CarolinaYes, mortgaged/ deed of trust or similar debt200325004
2North CarolinaNo, owned free and clear050001
3South CarolinaYes, mortgaged/ deed of trust or similar debt360750004
4South CarolinaYes, contract to purchase430225003
5North CarolinaYes, mortgaged/ deed of trust or similar debt450650004

2.3.1 Assigning Labels and Using SAS Formats

As seen in Chapter 1, SAS variable names have a certain set of restrictions they must meet, including no special characters other than an underscore. This potentially limits the quality of the display for items such as the headers in PROC PRINT. SAS does permit the assignment of labels to variables, substituting more descriptive text into the output in place of the variable name, as demonstrated in Program 2.3.2.

Program 2.3.2: Assigning Labels

proc print data=bookdata.ipums2005mini(obs=5) noobs label;

var state MortgageStatus MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home ($)’ state=’State’;

run;

 By default, the output from PROC PRINT includes an Obs column, which is simply the row number for the record—the NOOBS option in the PROC PRINT statement suppresses this column.

 Most SAS procedures use labels when they are provided or assigned; however, PROC PRINT defaults to using variable names. To use labels, the LABEL option is provided in the PROC PRINT statement. See Chapter Note 1 in Section 2.12 for more details.

 The LABEL statement assigns labels to selected variables. The general syntax is: LABEL variable1=’label1’ variable2=’label2’ …; where the labels are given as literal values in either single or double quotation marks, as long as the opening and closing quotation marks match.

Output 2.3.2: Assigning Labels

StateMortgageStatusMortgagePaymentValue of Home ($)METRO
South CarolinaYes, mortgaged/ deed of trust or similar debt200325004
North CarolinaNo, owned free and clear050001
South CarolinaYes, mortgaged/ deed of trust or similar debt360750004
South CarolinaYes, contract to purchase430225003
North CarolinaYes, mortgaged/ deed of trust or similar debt450650004

In addition to using labels to alter the display of variable names, altering the display of data values is possible with formats. The general form of a format reference is:

<$>format<w>.<d>

The <> symbols denote a portion of the syntax that is sometimes used/required—the <> characters are not part of the syntax. The dollar sign is required for any format that applies to a character variable (character formats) and is not permitted in formats used for numeric variables (numeric formats). The w value is the total number of characters (width) available for the formatted value, while d controls the number of values displayed after the decimal for numeric formats. The dot is required in all format assignments, and in many cases is the means by which the SAS compiler can distinguish between a variable name and a format name. The value of format is called the format name; however, standard numeric and character formats have a null name; for example, the 5.2 format assigns the standard numeric format with a total width of 5 and up to 2 digits displayed past the decimal. Program 2.3.3 uses the FORMAT statement to apply formats to the HomeValue, MortgagePayement, and MortgageStatus variables.

Program 2.3.3: Assigning Formats

proc print data=bookdata.ipums2005mini(obs=5) noobs label;

var state MortgageStatus MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

 In the FORMAT statement, a list of one or more variables is followed by a format specification. Both HomeValue and MortgagePayment are assigned a dollar format with a total width of nine—any commas and dollar signs inserted by this format count toward the total width.

 The MortgageStatus variable is character and can only be assigned a character format. The $1. format is the standard character format with width one, which truncates the display of MortgageStatus to one letter, but does not alter the actual value. In general, formats assigned in procedures are temporary and only apply to the output for the procedure.

Output 2.3.3: Assigning Formats

StateMortgageStatusMortgagePaymentValue of HomeMETRO
South CarolinaY$200$32,5004
North CarolinaN$0$5,0001
South CarolinaY$360$75,0004
South CarolinaY$430$22,5003
North CarolinaY$450$65,0004

2.3.2 PROC SORT and BY-Group Processing

Rows in a data set can be reordered using the SORT procedure to sort the data on the values of one or more variables in ascending or descending order. Program 2.3.4 sorts the BookData.Ipums2005Mini data set by the HomeValue variable.

Program 2.3.4: Sorting Data with the SORT Procedure

proc sort data=bookdata.ipums2005mini out=work.sorted;

by HomeValue;

run;

proc print data=work.sorted(obs=5) noobs label;

var state MortgageStatus MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

 The default behavior of the SORT procedure is to replace the input data set, specified in the DATA= option, with the sorted data set. To create a new data set from the sorted observations, use the OUT= option.

 The BY statement is required in PROC SORT and must name at least one variable. As shown in Output 2.3.4, the rows are now ordered in increasing levels of HomeValue.

Output 2.3.4: Sorting Data with the SORT Procedure

StateMortgageStatusMortgagePaymentValue of HomeMETRO
North CarolinaN$0$5,0001
South CarolinaY$430$22,5003
North CarolinaY$300$22,5003
South CarolinaY$200$32,5004
North CarolinaN$0$45,0001

Sorting on more than one variable gives a nested or hierarchical sorting. In those cases, values are ordered on the first variable, then for groups of records having the same value of the first variable those records are sorted on the second variable, and so forth. A specification of ascending (the default) or descending order is made for each variable. Program 2.3.5 sorts the BookData.Ipums2005Mini data set on three variables present in the data set.

Program 2.3.5: Sorting on Multiple Variables

proc sort data=bookdata.ipums2005mini out=work.sorted;

by MortgagePayment descending State descending HomeValue;

run;

proc print data=work.sorted(obs=6) noobs label;

var state MortgageStatus MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

 The first sort is on MortgagePayment, in ascending order. Since 0 is the lowest value and that value occurs on six records in the data set, Output 2.3.5 shows one block of records with MortgagePayment 0.

 The next sort is on State in descending order—note that the DESCENDING option precedes the variable it applies to. For the six records shown in Output 2.3.5, the first three are South Carolina and the final three are North Carolina—descending alphabetical order. Note, when sorting character data, casing matters—uppercase values are before lowercase in such a sort. For more details about determining the sort order of character data, see Chapter Note 2 in Section 2.12.

 The final sort is on HomeValue, also in descending order—note that the DESCENDING option must precede each variable it applies to. So, within each State group in Output 2.3.5, values of the HomeValue variable are in descending order.

Output 2.3.5: Sorting on Multiple Variables

StateMortgageStatusMortgagePaymentValue of HomeMETRO
South CarolinaN$0$137,5003
South CarolinaN$0$95,0004
South CarolinaN$0$45,0003
North CarolinaN$0$162,5000
North CarolinaN$0$45,0001
North CarolinaN$0$5,0001

Most SAS procedures, including PROC PRINT, can take advantage of BY-group processing for data that is sorted into groups. The procedure must use a BY statement that corresponds to the sorting in the data set. If the data is sorted using PROC SORT, the BY statement in a subsequent procedure does not have to completely match the BY statement in PROC SORT; however, it must match the first level of sorting if only one variable is included, the first two levels if two variables are included, and so forth. It must also match ordering, ascending or descending, on each included variable. Program 2.3.6 groups output from the PRINT procedure based on BY grouping constructed with PROC SORT.

Program 2.3.6: BY-Group Processing in PROC PRINT

proc sort data=bookdata.ipums2005mini out= work.sorted;

by MortgageStatus State descending HomeValue;

run;

proc print data= work.sorted noobs label;

by MortgageStatus State;

var MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

 The original data is sorted first on MortgageStatus, then on State, and finally in descending order of HomeValue for each combination of MortgageStatus and State.

 PROC PRINT uses a BY statement matching on the MortgageStatus and State variables, which groups the output into sections based on each unique combination of values for these two variables, with the final sorting on HomeValue appearing in each table. Note that a BY statement with only MortgageStatus can be used as well, but a BY statement with only State cannot—the data is not sorted on State primarily.

Output 2.3.6: BY-Group Processing in PROC PRINT (First 2 of 6 Groups Shown)

MortgageStatus=No, owned State=North Carolina

MortgagePaymentValue of HomeMETRO
$0$162,5000
$0$45,0001
$0$5,0001

MortgageStatus=No, owned State=South Carolina

MortgagePaymentValue of HomeMETRO
$0$137,5003
$0$95,0004
$0$45,0003

The structure of BY groups in PROC PRINT can be altered slightly through use of an ID statement, as shown in Program 2.3.7. Assuming the variables listed in the ID statement match those in the BY statement, BY-group variables are placed as the left-most columns of each table, rather than between tables.

Program 2.3.7: Using BY and ID Statements Together in PROC PRINT

proc print data= work.sorted noobs label;

by MortgageStatus State;

id MortgageStatus State;

var MortgagePayment HomeValue Metro;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

Output 2.3.7: Using BY and ID Statements Together in PROC PRINT (First 2 of 6 Groups Shown)

MortgageStatusStateMortgagePaymentValue of HomeMETRO
No, ownedNorth Carolina$0$162,5000
$0$45,0001
$0$5,0001

MortgageStatusStateMortgagePaymentValue of HomeMETRO
No, ownedSouth Carolina$0$137,5003
$0$95,0004
$0$45,0003

PROC PRINT is limited in its ability to do computations. (Later in this text, the REPORT procedure is used to create various summary tables.); however, it can do sums of numeric variables with the SUM statement, as shown in Program 2.3.8.

Program 2.3.8: Using the SUM Statement in PROC PRINT

proc print data= work.sorted noobs label;

by MortgageStatus State;

id MortgageStatus State;

var MortgagePayment HomeValue Metro;

sum MortgagePayment HomeValue;

label HomeValue=’Value of Home’ state=’State’;

format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

Output 2.3.8: Using the SUM Statement in PROC PRINT (Last of 6 Groups Shown)

MortgageStatusStateMortgagePaymentValue of HomeMETRO
Yes, mortSouth Carolina$360$75,0004
$500$65,0003
$200$32,5004
Yes, mortSouth Carolina$1,060$172,500
Yes, mort$2,200$315,000
$4,230$1200000

Sums are produced at the end of each BY group (and the SUMBY statement is available to modify this behavior), and at the end of the full table. Note that the format applied to the HomeValue column is not sufficient to display the grand total with the dollar sign and comma. If a format is of insufficient width, SAS removes what it determines to be the least important characters. However, it is considered good programming practice to determine the minimum format width needed for all values a format is applied to. If the format does not include sufficient width to display the value with full precision, then SAS may adjust the included format to a different format. See Chapter Note 3 in Section 2.12 for further discussion on format widths.

2.4 Using the MEANS Procedure for Quantitative Summaries

Producing tables of statistics like those shown for the case study in Outputs 2.2.1 and 2.2.2 uses MEANS procedure. This section covers the fundamentals of PROC MEANS, including how to select variables for analysis, choosing statistics, and separating analyses across categories.

2.4.1 Choosing Analysis Variables and Statistics in PROC MEANS

To begin, make sure the BookData library is assigned as done in Chapter 1, submit PROC CONTENTS on the IPUMS2005Basic SAS data set from the BookData library, and review the output. Also, to ensure familiarity with the data, open the data set for viewing or run the PRINT procedure to direct it to an output table. Once these steps are complete, enter and submit the code given in Program 2.4.1.

Program 2.4.1: Default Statistics and Behavior for PROC MEANS

options nolabel;

proc means data=BookData.IPUMS2005Basic;

run;

For variables that have labels, PROC MEANS includes them as a column in the output table; using NOLABEL in the OPTIONS statement suppresses their use. Here DATA= is technically an option; however, the default data set in any SAS session is the last data set created. If no data sets have been created during the session, which is the most likely scenario currently, PROC MEANS does not have a data set to process unless this option is provided. Beyond having a data set to work with, no other options or statements are required for PROC MEANS to compile and execute successfully. In this case, the default behavior, as shown in Output 2.4.1, is to summarize all numeric variables on a set of five statistics: number of nonmissing observations, mean, standard deviation, minimum, and maximum.

Output 2.4.1: Default Statistics and Behavior for PROC MEANS

VariableNMeanStd DevMinimumMaximum
SERIALCOUNTYFIPSMETROCITYPOPMortgagePaymentHHIncomeHomeValue1159062115906211590621159062115906211590621159062621592.2442.20629012.52453542916.66500.204263463679.842793526.49359865.4178.95432851.308530212316.27737.988559266295.974294777.182.00000000000-29997.005000.001245246.00810.00000004.000000079561.007900.001739770.009999999.00

SAS differentiates variable types as numeric and character only; therefore, variables stored as numeric that are not quantitative are summarized even if those summaries do not make sense. Here, the Serial, CountyFIPS, and Metro variables are stored as numbers, but means and standard deviations are of no utility on these since they are nominal. It is, of course, important to understand the true role and level of measurement (for instance, nominal versus ratio) for the variables in the data set being analyzed.

To select the variables for analysis, the MEANS procedure includes the VAR statement. Any variables listed in the VAR statement must be numeric, but should also be appropriate for quantitative summary statistics. As in the previous example, the summary for each variable is listed in its own row in the output table. (If only one variable is provided, it is named in the header above the table instead of in the first column.) Program 2.4.2 modifies Program 2.4.1 to summarize only the truly quantitative variables from BookData.IPUMS2005Basic, with the results shown in Output 2.4.2.

Program 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS

proc means data=BookData.IPUMS2005Basic;

var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS

VariableNMeanStd DevMinimumMaximum
CITYPOPMortgagePaymentHHIncomeHomeValue11590621159062115906211590622916.66500.204263463679.842793526.4912316.27737.988559266295.974294777.1800-29997.005000.0079561.007900.001739770.009999999.00

The default summary statistics for PROC MEANS can be modified by including statistic keywords as options in the PROC MEANS statement. Several statistics are available, with the available set listed in the SAS Documentation, and any subset of those may be used. The listed order of the keywords corresponds to the order of the statistic columns in the table, and those replace the default statistic set. One common set of statistics is the five-number summary (minimum, first quartile, median, third quartile, and maximum), and Program 2.4.3 provides a way to generate these statistics for the four variables summarized in the previous example.

Program 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

proc means data=BookData.IPUMS2005Basic min q1 median q3 max;

var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

VariableMinimumLower QuartileMedianUpper QuartileMaximum
CITYPOPMortgagePaymentHHIncomeHomeValue00-29997.005000.000024000.00112500.000047200.00225000.000830.000000080900.009999999.0079561.007900.001739770.009999999.00

Confidence limits for the mean are included in the keyword set, both as a pair with the CLM keyword, and separately with LCLM and UCLM. The default confidence level is 95%, but is changeable by setting the error rate using the ALPHA= option. Consider Program 2.4.4, which constructs the 99% confidence intervals for the means, with the estimated mean between the lower and upper limits.

Program 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01;

var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

VariableLower 99%CL for MeanMeanUpper 99%CL for Mean
CITYPOPMortgagePaymentHHIncomeHomeValue2887.19498.438574963521.222783250.942916.66500.204263463679.842793526.492946.12501.969952063838.462803802.04

There are also options for controlling the column display; rounding can be controlled by the MAXDEC= option (maximum number of decimal places). Program 2.4.5 modifies the previous example to report the statistics to a single decimal place.

Program 2.4.5: Using MAXDEC= to Control Precision of Results

proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01 maxdec=1;

var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.5: Using MAXDEC= to Control Precision of Results

VariableLower 99%CL for MeanMeanUpper 99%CL for Mean
CITYPOPMortgagePaymentHHIncomeHomeValue2887.2498.463521.22783250.92916.7500.263679.82793526.52946.1502.063838.52803802.0

MAXDEC= is limited in that it sets the precision for all columns. Also, no direct formatting of the statistics is available. The REPORT procedure, introduced in Chapter 4 and discussed in detail in Chapters 6 and 7, provides much more control over the displayed table at the cost of increased complexity of the syntax.

2.4.2 Using the CLASS Statement in PROC MEANS

In several instances, it is desirable to split an analysis across a set of categories and, if those categories are defined by a variable in the data set, PROC MEANS can separate those analyses using a CLASS statement. The CLASS statement accepts either numeric or character variables; however, the role assigned to class variables by SAS is special. Any variable included in the CLASS statement (regardless of type) is taken as categorical, which results in each distinct value of the variable corresponding to a unique category. Therefore, variables used in the CLASS statement should provide useful groupings or, as shown in Section 2.5, be formatted into a set of desired groups. Two examples follow, the first (Program 2.4.6) providing an illustration of a reasonable class variable, the second (Program 2.4.7) showing a poor choice.

Program 2.4.6: Setting a Class Variable in PROC MEANS

proc means data=BookData.IPUMS2005Basic;

class MortgageStatus;

var HHIncome;

run;

Output 2.4.6: Setting a Class Variable in PROC MEANS

Analysis Variable : HHIncome
MortgageStatusN ObsNMeanStd DevMinimumMaximum
N/A30334230334237180.5939475.13-19998.001070000.00
No, owned free and clear30034930034953569.0863690.40-22298.001739770.00
Yes, contract to purchase9756975651068.5046069.11-7599.00834000.00
Yes, mortgaged/ deed of trust or similar debt54561554561584203.7072997.92-29997.001407000.00

In this data, MortgageStatus provides a clear set of distinct categories and is potentially useful for subsetting the summarization of the data. In Program 2.4.7, Serial is used as an extreme example of a poor choice since Serial is unique to each household.

Program 2.4.7: A Poor Choice for a Class Variable

proc means data=BookData.IPUMS2005Basic;

class Serial;

var HHIncome;

run;

Output 2.4.7: A Poor Choice for a Class Variable (Partial Table Shown)

Analysis Variable : HHIncome
SERIALN ObsNMeanStd DevMinimumMaximum
21112000.00.12000.0012000.00
31117800.00.17800.0017800.00
411185000.00.185000.00185000.00
5112000.00.2000.002000.00

Choosing Serial as a class variable results in each class being a single observation, making the mean, minimum, and maximum the same value and creating a situation where the standard deviation is undefined. Again, this would be an extreme case; however, class variables are best when structured to produce relatively few classes that represent a useful stratification of the data.

Of course, more than one variable can be used in a CLASS statement; the categories are then defined as all combinations of the categories from the individual variables. The order of the variables listed in the CLASS statement only alters the nesting order of the levels; therefore, the same information is produced in a different row order in the table. Consider the two MEANS procedures in Program 2.4.8.

Program 2.4.8: Using Multiple Class Variables and Effects of Order

proc means data=BookData.IPUMS2005Basic nonobs n mean std;

class MortgageStatus Metro;

var HHIncome;

run;

proc means data=BookData.IPUMS2005Basic nonobs n mean std;

class Metro MortgageStatus;

var HHIncome;

run;

Output 2.4.8A: Using Multiple Class Variables (Partial Listing)

Analysis Variable : HHIncome
MortgageStatusMETRONMeanStd Dev
N/A01900931672.8132122.89
14861829122.7329160.23
26920138749.6946226.50
37323443325.2542072.78
49328036514.5636974.63
No, owned free and clear03037046533.1450232.50
18569642541.0644664.64
22728660011.1076580.75
37672763925.9975404.62
48027055915.0266293.39

Output 2.4.8B: Effects of Order (Partial Listing)

Analysis Variable : HHIncome
METROMortgageStatusNMeanStd Dev
0N/A1900931672.8132122.89
No, owned free and clear3037046533.1450232.50
Yes, contract to purchase103046069.2636225.80
Yes, mortgaged/ deed of trust or similar debt4161971611.0155966.31
1N/A4861829122.7329160.23
No, owned free and clear8569642541.0644664.64
Yes, contract to purchase303442394.1235590.14
Yes, mortgaged/ deed of trust or similar debt9342762656.5448808.66

The same statistics are present in both tables, but the primary ordering is on MortgageStatus in Output 2.4.8A as opposed to metropolitan status (Metro) in Output 2.4.8B. Two additional items of note from this example: first, note the use of NONOBS in each. By default, using a CLASS statement always produces a column for the number of observations in each class level (NOBS), and this may be different from the statistic N due to missing data, but that is not an issue for this example. Second, the numeric values of Metro really have no clear meaning. Titles and footnotes, as shown in Chapter 1, are available to add information about the meaning of these numeric values. However, a better solution is to build a format and apply it to that variable, a concept covered in the next section.

2.5 User-Defined Formats

As seen in Section 2.3, SAS provides a variety of formats for altering the display of data values. It is also possible to define formats using the FORMAT procedure. These formats are used to assign replacements for individual data values or for groups or ranges of data, and they may be permanently stored in a library for subsequent use. Formats, both native SAS formats and user-defined formats, are an invaluable tool that are used in a variety of contexts throughout this book.

2.5.1 The FORMAT Procedure

The FORMAT procedure provides the ability to create custom formats, both for character and numeric variables. The principal tool used in writing formats is the VALUE statement, which defines the name of the format and its rules for converting data values to formatted values. Program 2.5.1 gives an example of a format written to improve the display of the Metro variable from the BookData.IPUMS2005Basic data set.

Program 2.5.1: Defining a Format for the Metro Variable

proc format;

value Metro

0 = “Not Identifiable”

1 = “Not in Metro Area”

2 = “Metro, Inside City”

3 = “Metro, Outside City”

4 = “Metro, City Status Unknown”

;

run;

 The VALUE statement tends to be rather long given the number of items it defines. Remember, SAS code is generally free-form outside of required spaces and delimiters, along with the semicolon that ends every statement. Adopt a sound strategy for using indentation and line breaks to make code readable.

 The VALUE statement requires the format name, which follows the SAS naming conventions of up to 32 characters, but with some special restrictions. Format names must meet an additional restriction of being distinct from the names of any formats supplied by SAS. Also, given that numbers are used to define format widths, a number at the end of a format name would create an ambiguity in setting lengths; therefore, format names cannot end with a number. If the format is for character values, the name must begin with $, and that character counts toward the 32-character limit.

 In this format, individual values are set equal to their replacements (as literals) for all values intended to be formatted. Values other than 0, 1, 2, 3, and 4 may not appear as intended. For a discussion of displaying values other than those that appear in the VALUE statement, see Chapter Note 4 in Section 2.12.

 The semicolon that ends the value statement is set out on its own line here for readability—simply to make it easy to verify that it is present.

Submitting Program 2.5.1 makes a format named Metro in the format catalog in the Work library, it only takes effect when used, and it is used in effectively the same manner as a format supplied by SAS. Program 2.5.2 uses the Metro format for the class variable Metro to alter the appearance of its values in Output 2.5.2. Note that since the variable Metro and the format Metro have the same name, and since no width is required, the only syntax element that distinguishes these to the SAS compiler is the required dot (.) in the format name.

Program 2.5.2: Using the Metro Format

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class Metro;

var HHIncome;

format Metro Metro.;

run;

Output 2.5.2: Using the Metro Format

Analysis Variable : HHIncome
METRONMeanStd DevMinimumMaximum
Not Identifiable920285480052333-199981076000
Not in Metro Area2307754785645547-299971050000
Metro, Inside City1543686032870874-199981391000
Metro, Outside City3409827764875907-299971739770
Metro, City Status Unknown3409096433566110-222981536000

For this case, a simplified format that distinguishes metro, non-metro, and non-identifiable observations may be desired. Program 2.5.3 contains two approaches to this, the first being clearly the most efficient.

Program 2.5.3: Assigning Multiple Values to the Same Formatted Value

proc format;

value MetroB

0 = “Not Identifiable”

1 = “Not in Metro Area”

2,3,4  = “In a Metro Area”

;

value MetroC

0 = “Not Identifiable”

1 = “Not in Metro Area”

2 = “In a Metro Area”

3 = “In a Metro Area”

4 = “In a Metro Area”

;

run;

 A comma-separated list of values is legal on the left side of each assignment, which assigns the formatted value to each listed data value.

 This format accomplishes the same result; however, it is important that the literal values on the right side of the assignment are exactly the same. Differences in even simple items like spacing or casing results in different formatted values.

Either format given in Program 2.5.3 can replace the Metro format in Program 2.5.2 to create the result in Output 2.5.3.

Output 2.5.3: Assigning Multiple Values to the Same Formatted Value

Analysis Variable : HHIncome
METRONMeanStd DevMinimumMaximum
Not Identifiable920285480052333-199981076000
Not in Metro Area2307754785645547-299971050000
In a Metro Area8362596902471495-299971739770

It is also possible to use the dash character as an operator in the form of ValueA-ValueB to define a range on the left side of any assignment, which assigns the formatted value to every data value between ValueA and ValueB, inclusive. Program 2.5.4 gives an alternate strategy to constructing the formats given in Program 2.5.3 and that format can also be placed into Program 2.5.2 to produce Output 2.5.3.

Program 2.5.4: Assigning a Range of Values to a Single Formatted Value

proc format;

value MetroD

0 = “Not Identifiable”

1 = “Not in Metro Area”

2-4 = “In a Metro Area”

;

run;

Certain keywords are also available for use on the left side of an assignment, one of which is OTHER. OTHER applies the assigned format to any value not listed on the left side of an assignment elsewhere in the format definition. Program 2.5.5 uses OTHER to give another method for creating a format that can be used to generate Output 2.5.3. It is important to note that using OTHER often requires significant knowledge of exactly what values are present in the data set.

Program 2.5.5: Assigning a Range of Values to a Single Formatted Value

proc format;

value MetroE

0 = “Not Identifiable”

1 = “Not in Metro Area”

other = “In a Metro Area”

;

run;

In general, value ranges should be non-overlapping, and the < symbol—called an exclusion operator in this context—can be used at either end (or both ends) of the dash to indicate the value should not be included in the range. Overlapping ranges are discussed in Chapter Note 5 in Section 2.12. Using exclusion operators to create non-overlapping ranges allows for the categorization of a quantitative variable without having to know the precision of measurement. Program 2.5.6 gives two variations on creating bins for the MortgagePayment data and uses those bins as classes in PROC MEANS, with the results shown in Output 2.5.6A and Output 2.5.6B.

Program 2.5.6: Binning a Quantitative Variable Using a Format

proc format;

value Mort

0=’None’

1-350=”$350 and Below”

351-1000=”$351 to $1000”

1001-1600=”$1001 to $1600”

1601-high=”Over $1600”

;

value MortB

0=’None’

1-350=”$350 and Below”

350<-1000=”Over $350, up to $1000”

1000<-1600=”Over $1000, up to $1600”

1600<-high=”Over $1600”

;

run;

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class MortgagePayment;

var HHIncome;

format MortgagePayment Mort.;

run;

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class MortgagePayment;

var HHIncome;

format MortgagePayment MortB.;

run;

 The keywords LOW and HIGH are available so that the maximum and minimum values need not be known. When applied to character data, LOW and HIGH refer to the sorted alphanumeric values. Note that the LOW keyword excludes missing values for numeric variables but includes missing values for character variables.

 In these value ranges, the values used exploit the fact that the mortgage payments are reported to the nearest dollar.

 Using the < symbol to not include the starting ranges allows the bins to be mutually exclusive and exhaustive irrespective of the precision of the data values. The exclusion operator, <, omits the adjacent value from the range so that 350<-1000 omits only 350, 350-<1000 omits only 1000, and 350<-<1000 omits both 350 and 1000.

 When a format is present for a class variable, the format is used to construct the unique values for each category, and this behavior persists in most cases where SAS treats a variable as categorical.

Output 2.5.6A: Binning a Quantitative Variable Using the Mort Format

Analysis Variable : HHIncome
MortgagePaymentNMeanStd DevMinimumMaximum
None6036914533453557-222981739770
$350 and Below598564785142062-16897841000
$351 to $10002831116499245107-199981060000
$1001 to $16001288019610763008-299971125000
Over $160083603153085117134-299971407000

Output 2.5.6B: Binning a Quantitative Variable Using the MortB Format

Analysis Variable : HHIncome
MortgagePaymentNMeanStd DevMinimumMaximum
None6036914533453557-222981739770
$350 and Below598564785142062-16897841000
Over $350, up to $10002831116499245107-199981060000
Over $1000, up to $16001288019610763008-299971125000
Over $160083603153085117134-299971407000

2.5.2 Permanent Storage and Inspection of Defined Formats

Formats can be permanently stored in a catalog (with the default name of Formats) in any assigned SAS library via the use of the LIBRARY= option in the PROC FORMAT statement. As an example, consider Program 2.5.7, which is a revision and extension of Program 2.5.6.

Program 2.5.7: Revisiting Program 2.5.6, Adding LIBRARY= and FMTLIB Options

proc format library=sasuser;

value Mort

0=’None’

1-350=”$350 and Below”

351-1000=”$351 to $1000”

1001-1600=”$1001 to $1600”

1601-high=”Over $1600”

;

value MortB

0=’None’

1-350=”$350 and Below”

350<-1000=”Over $350, up to $1000”

1000<-1600=”Over $1000, up to $1600”

1600<-high=”Over $1600”

;

run;

proc format fmtlib library=sasuser;

run;

Using the LIBRARY= option in this manner places the format definitions into the Formats catalog in the Sasuser library and accessing them in subsequent coding sessions requires the system option FMTSEARCH=(SASUSER) to be specified prior to their use. An alternate format catalog can also be used via two-level naming of the form libref.catalog, with the catalog being created if it does not already exist. Any catalog in any library that contains stored formats to be used in a given session can be listed as a set inside the parentheses following the FMTSEARCH= option. Those listed are searched in the given order, with WORK.FORMATS being defined implicitly as the first catalog to be searched unless it is included explicitly in the list.

The FMTLIB option shows information about the formats in the chosen library in the Output window, Output 2.5.7 shows the results for this case.

Output 2.5.7: Revisiting Program 2.5.6, Adding LIBRARY= and FMTLIB Options


The top of the table includes general information about the format, including the name, various lengths, and number of format categories. The default length corresponds to the longest format label set in the VALUE statement. The rows below have columns for each format label and the start and end of each value range. Note that the first category in each of these formats is assigned to a range, even though it only contains a single value, with the start and end values being the same. The use of < as an exclusion operator is also shown in ranges where it is used, and the keyword HIGH is left-justified in the column where it is used. Note the exclusion operation is applied to the value of 1600 at the low end of the range, it is a syntax error to attempt to apply it to the keyword HIGH (or LOW).

2.6 Subsetting with the WHERE Statement

In many cases, only a subset of the data is used, with the subsetting criteria based on the values of variables in the data set. In these cases, using the WHERE statement allows conditions to be set which choose the records a SAS procedure processes while ignoring the others—no modification to the data set itself is required. If the OBS= data set option is in use, the number chosen corresponds to the number of observations meeting the WHERE condition.

In order to use the WHERE statement, it is important to understand the comparison and logical operators available. Basic comparisons like equality or various inequalities can be done with symbolic or mnemonic operators—Table 2.6.1 shows the set of comparison operators.

Table 2.6.1: Comparison Operators

OperationSymbolMnemonic
Equal=EQ
Not Equal^=NE
Less Than<LT
Less Than or Equal<=LE
Greater Than>GT
Greater Than or Equal>=GE

In addition to comparison operators, Boolean operators for negation and compounding (along with some special operators) are also available—Table 2.6.2 summarizes these operators.

Table 2.6.2: Boolean and Associated Operators

SymbolMnemonicLogic
&ANDTrue result if both conditions are true
|ORTrue result if either, or both, conditions are true
INTrue if matches any element in a list
BETWEEN-ANDTrue if in a range of values (including endpoints)
~NOTNegates the condition that follows

Revisiting Program 2.5.2 and Output 2.5.2, subsetting the results to only include observations known to be in a metro area can be accomplished with any one of the following WHERE statements.

 where Metro eq 2 or Metro eq 3 or Metro eq 4;

 where Metro ge 2 and Metro le 4;

 where Metro in (2,3,4);

 where Metro between 2 and 4;

 where Metro not in (0,1);

 Each possible value can be checked by using the OR operator between equality comparisons for each possible value. When using OR, each comparison must be complete/specific. For example, it is not legal to say: Metro eq 2 or eq 3 or eq 4. It is legal, but unhelpful, to say Metro eq 2 or 3 or 4, as SAS uses numeric values for truth (since it does not include Boolean variables). The values 0 and missing are false, while any other value is true; hence, Metro eq 2 or 3 or 4 is an immutably true condition.

 This conditioning takes advantage of the fact that the desired values fall into a range. As with OR, each condition joined by the AND must be complete; again, it is not legal to say: Metro ge 2 and le 4. Also, with knowledge of the values of Metro, this condition could have been simplified to Metro ge 2. However, good programming practice dictates that specificity is preferred to avoid incorrect assumptions about data values.

 IN allows for simplification of a set of conditions that might otherwise be written using the OR operator, as was done in . The list is given as a set of values separated by commas or spaces and enclosed in parentheses.

 BETWEEN-AND allows for simplification of a value range that can otherwise be written using AND between appropriate comparisons, as was done in .

 The NOT operator allows the truth condition to be made the opposite of what is specified. This is a slight improvement over , as the list of values not desired is shorter than the list of those that are.

Adding any of these WHERE statements (or any other logically equivalent WHERE statement) to Program 2.5.2 produces the results shown in Table 2.6.3.

Table 2.6.3: Using WHERE to Subset Results to Specific Values of the Metro Variable

Analysis Variable : HHIncome
METRONMeanStd DevMinimumMaximum
Metro, Inside City1543686032870874-199981391000
Metro, Outside City3409827764875907-299971739770
Metro, City Status Unknown3409096433566110-222981536000

The tools available allow for conditioning on more than one variable, and the variable(s) conditioned on need only be in the data set in use and do not have to be present in the output generated. In Program 2.6.1, the output is conditioned additionally on households known to have an outstanding mortgage.

Program 2.6.1: Conditioning on a Variable Not Used in the Analysis

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class Metro;

var HHIncome;

format Metro Metro.;

where Metro in (2,3,4)

and

MortgageStatus in

(‘Yes, contract to purchase’,

‘Yes, mortgaged/ deed of trust or similar debt’);

run;

Output 2.6.1: Conditioning on a Variable Not Used in the Analysis

Analysis Variable : HHIncome
METRONMeanStd DevMinimumMaximum
Metro, Inside City578818627782749-199981361000
Metro, Outside City1910219631980292-299971266000
Metro, City Status Unknown1673598387972010-199981407000

The condition on the MortgageStatus variable is a bit daunting, particularly noting that matching character values is a precise operation. Seemingly simple differences like casing or spacing lead to values that are non-matching. Therefore, the literals used in Program 2.6.1 are specified to be an exact match for the data. In Section 3.9, functions are introduced that are useful in creating consistency among character values, along with others that allow for extraction and use of relevant portions of a string. However, the WHERE statement provides some special operators, shown in Table 2.6.4, that allow for simplification in these types of cases without the need to intervene with a function.

Table 2.6.4: Operators for General Comparisons

SymbolMnemonicLogic
?CONTAINSTrue result if the specified value is contained in the data value (character only).
LIKETrue result if data value matches the specified value which may include wildcards. _ is any single character, % is any set of characters.

Program 2.6.2 offers two methods for simplifying the condition on MortgageStatus, one using CONTAINS, the other using LIKE. Either reproduces Output 2.6.1.

Program 2.6.2: Conditioning on a Variable Using General Comparison Operators

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class Metro;

var HHIncome;

format Metro Metro.;

where Metro in (2,3,4) and MortgageStatus contains ’Yes’;

run;

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

class Metro;

var HHIncome;

format Metro Metro.;

where Metro in (2,3,4) and MortgageStatus like ’%Yes%’;

run;

 CONTAINS checks to see if the data value contains the string Yes; again, note that the casing must be correct to ensure a match. Also, ensure single or double quotation marks enclose the value to search for—in this case, without the quotation marks, Yes forms a legal variable name and is interpreted by the compiler as a reference to a variable.

 LIKE allows for the use of wildcards as substitutes for non-essential character values. Here the % wildcard before and after Yes results in a true condition if Yes appears anywhere in the string and is thus logically equivalent to the CONTAINS conditioning above.

2.7 Using the FREQ Procedure for Categorical Summaries

To produce tables of frequencies and relative frequencies (percentages) like those shown for the case study in Outputs 2.2.3 and 2.2.4, the FREQ procedure is the tool of choice, and this section covers its fundamentals.

2.7.1 Choosing Analysis Variables in PROC FREQ

As in previous sections, the examples here use the IPUMS2005Basic SAS data set, so make sure the BookData library is assigned. As a first step, enter and submit Program 2.7.1. (Note that the use of labels has been re-established in the OPTIONS statement.)

Program 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

options label;

proc freq data=BookData.IPUMS2005Basic;

table metro mortgageStatus;

run;

The TABLE statement allows for specification of the variables to summarize, and a space-delimited list of variables produces a one-way frequency table for each, as shown in Output 2.7.1.

Output 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

Metropolitan status
METROFrequencyPercentCumulativeFrequencyCumulativePercent
0920287.94920287.94
123077519.9132280327.85
215436813.3247717141.17
334098229.4281815370.59
434090929.411159062100.00
MortgageStatusFrequencyPercentCumulativeFrequencyCumulativePercent
N/A30334226.1730334226.17
No, owned free and clear30034925.9160369152.08
Yes, contract to purchase97560.8461344752.93
Yes, mortgaged/ deed of trust or similar debt54561547.071159062100.00

The TABLE statement is not required; however, in that case, the default behavior produces a one-way frequency table for every variable in the data set. Therefore, both types of SAS variables, character or numeric, are legal in the TABLE statement. Given that variables listed in the TABLE statement are treated as categorical (in the same manner as variables listed in the CLASS statement in PROC MEANS), it is best to have the summary variables be categorical or be formatted into a set of categories.

The default summaries in a one-way frequency table are: frequency (count), percent, cumulative frequency, and cumulative percent. Of course, the cumulative statistics only make sense if the categories are ordinal, which these are not. Many options are available in the table statement to control what is displayed, and one is given in Program 2.7.2 to remove the cumulative statistics.

Program 2.7.2: PROC FREQ Option for Removing Cumulative Statistics

proc freq data=BookData.IPUMS2005Basic;

table metro mortgageStatus / nocum;

run;

As with the CLASS statement in the MEANS procedure, variables listed in the TABLE statement in PROC FREQ use the format provided with the variable to construct the categories. Program 2.7.3 uses a format defined in Program 2.5.6 to bin the MortgagePayment variable into categories and, as this is an ordinal set, the cumulative statistics are appropriate.

Program 2.7.3: Using a Format to Control Categories for a Variable in the TABLE Statement

proc format;

value Mort

0=’None’

1-350=”$350 and Below”

351-1000=”$351 to $1000”

1001-1600=”$1001 to $1600”

1601-high=”Over $1600”

;

run;

proc freq data=BookData.IPUMS2005Basic;

table MortgagePayment;

format MortgagePayment Mort.;

run;

Output 2.7.3: Using a Format to Control Categories for a Variable in the TABLE Statement

First mortgage monthly payment
MortgagePaymentFrequencyPercentCumulativeFrequencyCumulativePercent
None60369152.0860369152.08
$350 and Below598565.1666354757.25
$351 to $100028311124.4394665881.67
$1001 to $160012880111.11107545992.79
Over $1600836037.211159062100.00

The FREQ procedure is not limited to one-way frequencies—special operators between variables in the TABLE statement allow for construction of multi-way tables.

2.7.2 Multi-Way Tables in PROC FREQ

The * operator constructs cross-tabular summaries for two categorical variables, which includes the following statistics:

 cross-tabular and marginal frequencies

 cross-tabular and marginal percentages

 conditional percentages within each row and column

Program 2.7.4 summarizes all combinations of Metro and MortgagePayment, with Metro formatted to add detail and MortgagePayment formatted into the bins used in the previous example.

Program 2.7.4: Using the * Operator to Create a Cross-Tabular Summary with PROC FREQ

proc format;

value METRO

0 = “Not Identifiable”

1 = “Not in Metro Area”

2 = “Metro, Inside City”

3 = “Metro, Outside City”

4 = “Metro, City Status Unknown”

;

value Mort

0=’None’

1-350=”$350 and Below”

351-1000=”$351 to $1000”

1001-1600=”$1001 to $1600”

1601-high=”Over $1600”

;

run;

proc freq data=BookData.IPUMS2005Basic;

table Metro*MortgagePayment;

format Metro Metro. MortgagePayment Mort.;

run;

 The first variable listed in any request of the form A*B is placed on the rows in the table. Requesting MortgagePayment*Metro transposes the table and the included summary statistics.

 The format applied to the Metro variable is merely a change in display and has no effect on the structure of the table—it is five rows with or without the format. The format on MortgagePayment is essential to the column structure—allowing each unique value of MortgagePayment to form a column does not produce a useful summary table.

Output 2.7.4: Using the * Operator to Create a Cross-Tabular Summary with PROC FREQ

Table of METRO by MortgagePayment
METRO(Metropolitan status)MortgagePayment(First mortgage monthly payment)
FrequencyPercentRow PctCol PctNone$350 and Below$351 to $1000$1001 to $1600Over $1600Total
Not Identifiable493794.2653.668.1869790.607.5811.66254882.2027.709.0073070.637.945.6728750.253.123.44920287.94
Not in Metro Area13431411.5958.2022.25216981.879.4036.25609485.2626.4121.53104640.904.538.1233510.291.454.0123077519.91
Metro, Inside City964878.3262.5015.9844100.382.867.37288662.4918.7010.20140491.219.1010.91105560.916.8412.6315436813.32
Metro, Outside City14996112.9443.9824.84121481.053.5620.30793886.8523.2828.04563304.8616.5243.73431553.7212.6651.6234098229.42
Metro, City Status Unknown17355014.9750.9128.75146211.264.2924.43884217.6325.9431.23406513.5111.9231.56236662.046.9428.3134090929.41
Total60369152.08598565.1628311124.4312880111.11836037.211159062100.00

Various options are available to control the displayed statistics. Program 2.7.5 illustrates some of these with the result shown in Output 2.7.5.

Program 2.7.5: Using Options in the TABLE Statement.

proc freq data=BookData.IPUMS2005Basic;

table Metro*MortgagePayment / nocol nopercent format=comma10.;

format Metro Metro. MortgagePayment Mort.;

run;

 NOCOL and NOPERCENT suppress the column and overall percentages, respectively, with NOPERCENT also applying to the marginal totals. NOROW and NOFREQ are also available, with NOFREQ also applying to the marginal totals.

 A format can be applied to the frequency statistic; however, this only applies to cross-tabular frequency tables and has no effect in one-way tables.

Output 2.7.5: Using Options in the TABLE Statement

Table of METRO by MortgagePayment
METRO(Metropolitan status)MortgagePayment(First mortgage monthly payment)
FrequencyRow PctNone$350 and Below$351 to $1000$1001 to $1600Over $1600Total
Not Identifiable49,37953.666,9797.5825,48827.707,3077.942,8753.1292,028
Not in Metro Area134,31458.2021,6989.4060,94826.4110,4644.533,3511.45230,775
Metro, Inside City96,48762.504,4102.8628,86618.7014,0499.1010,5566.84154,368
Metro, Outside City149,96143.9812,1483.5679,38823.2856,33016.5243,15512.66340,982
Metro, City Status Unknown173,55050.9114,6214.2988,42125.9440,65111.9223,6666.94340,909
Total603,69159,856283,111128,80183,6031,159,062

Higher dimensional requests can be made; however, they are constructed as a series of two-dimensional tables. Therefore, a request of A*B*C in the TABLE statement creates the B*C table for each level of A, while a request of A*B*C*D makes the C*D table for each combination of A and B, and so forth. Program 2.7.6 generates a three-way table, where a cross-tabulation of Metro and HomeValue is built for each level of Mortgage Status as shown in Output 2.7.6. The VALUE statement that defines the character format $MortStatus takes advantage of the fact that value ranges are legal for character variables. Be sure to understand the difference between uppercase and lowercase letters when ordering the values of a character variable.

Program 2.7.6: A Three-Way Table in PROC FREQ

proc format;

value MetroB

0 = “Not Identifiable”

1 = “Not in Metro Area”

other = “In a Metro Area”

;

value $MortStatus

‘No’-’Nz’=’No’

‘Yes’-’Yz’=’Yes’

;

value Hvalue

0-65000=’$65,000 and Below’

65000<-110000=’$65,001 to $110,000’

110000<-225000=’$110,001 to $225,000’

225000<-500000=’$225,001 to $500,000’

500000-high=’Above $500,000’

;

run;

proc freq data=BookData.IPUMS2005Basic;

table MortgageStatus*Metro*HomeValue/nocol nopercent format=comma10.;

format MortgageStatus $MortStatus. Metro MetroB. HomeValue Hvalue.;

where MortgageStatus ne ‘N/A’;

run;

Output 2.7.6: A Three-Way Table in PROC FREQ

Table 1 of METRO by HomeValue
Controlling for MortgageStatus=No
METRO(Metropolitan status)HomeValue(House value)
FrequencyRow Pct$65,000 and Below$65,001 to $110,000$110,001 to $225,000$225,001 to $500,000Above $500,000Total
Not Identifiable10,77735.495,46017.9810,41534.292,5848.511,1343.7330,370
Not in Metro Area34,76640.5716,26118.9826,88931.385,5536.482,2272.6085,696
In a Metro Area34,17618.5523,70612.8671,13338.6033,59018.2321,67811.76184,283
Total79,71945,427108,43741,72725,039300,349
Table 2 of METRO by HomeValue
Controlling for MortgageStatus=Yes
METRO(Metropolitan status)HomeValue(House value)
FrequencyRow Pct$65,000 and Below$65,001 to $110,000$110,001 to $225,000$225,001 to $500,000Above $500,000Total
Not Identifiable7,48617.557,14216.7519,45345.616,46815.172,1004.9242,649
Not in Metro Area24,44325.3419,39620.1140,66842.169,1649.502,7902.8996,461
In a Metro Area26,3516.3337,3458.97175,48242.16110,41226.5266,67116.02416,261
Total58,28063,883235,603126,04471,561555,371

It is also possible for the FREQ procedure to count based on a quantitative variable using the WEIGHT statement, effectively tabulating the sum of the weights. Program 2.7.7 uses the weight statement to summarize total HomeValue for combinations of Metro and MortgagePayment.

Program 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value.

proc freq data=BookData.IPUMS2005Basic;

table Metro*MortgagePayment /nocol nopercent format=dollar14.;

weight HomeValue;

format Metro MetroB. MortgagePayment Mort.;

run;

Output 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value

Table of HomeValue by METRO
HomeValue(House value)METRO(Metropolitan status)
FrequencyRow PctNot IdentifiableNot in Metro AreaIn a Metro AreaTotal
$65,000 and Below$2,737,53012.20$8,736,98638.93$10,969,60048.88$22,444,116
$65,001 to $110,000$3,770,84010.74$9,887,45428.16$21,448,05261.09$35,106,346
$110,001 to $225,000$15,896,8547.82$30,632,55615.07$156,700,07477.10$203,229,484
$225,001 to $500,000$8,192,9084.80$10,741,2586.30$151,601,29488.90$170,535,460
Above $500,000$3,854,2802.60$4,735,2883.19$139,862,78094.21$148,452,348
Total$34,452,412$64,733,542$480,581,800$579,767,754

2.8 Reading Raw Data

Often data is not available as a SAS data set; in practice, data often comes from external sources including raw files such as text files, spreadsheets such as Microsoft Excel ®, or relational databases such as Oracle ®. In this section, work with external data sources begins by exploring how to read raw data files.

Raw data refers to certain files that contain unprocessed data that is not in a SAS data set. (Certain other structures also qualify as raw data. See Chapter Note 6 in Section 2.12 for additional details.) Generally, these are plain-text files and some common file types are:

 tab-delimited text (.txt or .tsv)

 comma-separated values (.csv)

 fixed-position files (.dat)

Choices for file extensions are not fixed; therefore, the extension does not dictate the type of data the file contains, and many other file types exist. Therefore, it is always important to explore the raw data before importing it to SAS. While SAS provides multiple ways to read raw data; this chapter focuses on using the DATA step due to its flexibility and ubiquity—understanding the DATA step is a necessity for a successful SAS programmer.

To assist in finding the column numbers when displaying raw files, a ruler is included in the first line when presenting raw data in the book, but the ruler is not present in the actual data file. Input Data 2.8.1 provides an example of such a ruler. Each dash in the ruler represents a column, while a plus represents multiples of five, and a digit represents multiples of ten. For example, the 1 in the ruler represents column 10 in the raw file and the plus sign between the 1 and the 2 represents column 15 in the raw file.

Input Data 2.8.1: Space Delimited Raw File (Partial Listing)

----+----1----+----2----+
1 1800 9998 9998 9998
2 480 1440 9998 9998
3 2040 360 100 9998
4 3000 9998 360 9998
5 840 1320 90 9998

2.8.1 Introduction to Reading Delimited Files

Delimiters, often used in raw files, are a single character such as a tab, space, comma, or pipe (vertical bar) used to indicate the break between one value and the next in a single record. Input Data 2.8.1 includes a partial representation of the first five records from a space-delimited file (Utility 2001.prn). Reading in this file, or any raw file, requires determining whether the file is delimited and, if so, what delimiters are present. If a file is delimited, it is important to note whether the delimiters also appear as part of the values for one or more variables. The data presented in Input Data 2.8.1 follows a basic structure and uses spaces to separate each record into five distinct values or fields. SAS can read this file correctly using simple list input without the need for additional options or statements using the following rules:

1. At least one blank/space must separate the input values and SAS treats multiple, sequential blanks as a single blank.

2. Character values cannot contain embedded blanks.

3. Character variables are given a length of eight bytes by default.

4. Data must be in standard numeric or character format. Standard numeric values must only contain digits, decimal point, +/-, and E for scientific notation.

Input Data 2.8.1 satisfies these rules using the default delimiter (space). Options and statements are available to help control the behavior associated with rules 1 through 3, which are covered in subsequent sections of this chapter. Violating rule 4 precludes the use of simple list input but is easily addressed with modified list input, as shown in Chapter 3. However, no such options or modifications are required to read Input Data 2.8.1, which is done using Program 2.8.1.

Program 2.8.1: Reading the Utility 2001 Data

data Utility2001;

infile “--insert path here--\Utility 2001.prn”;

input Serial$ Electric Gas Water Fuel;

run;

proc print data = Utility2001 (obs=5 );

run;

 The DATA statement begins the DATA step and here names the data set as Utility2001, placing it in the Work library given the single-level naming. Explicit specification of the library is available with two-level naming, for example, Sasuser.Utility2001 or Work.Utility2001—see Program 1.4.3. If no data set name appears, SAS provides the name as DATAn, where n is the smallest whole number (1, 2, 3, …) that makes the data set name unique.

 The INFILE statement specifies the location of the file via a full path specification to the file—this path must be completed to reflect the location of the raw file for the code to execute successfully.

 The INPUT statement sets the names of each variable from the raw file in the INFILE statement with those names following the conventions outlined in Section 1.6.2. By default, SAS assumes the incoming variables are numeric. One way to indicate character data is shown here – place a dollar sign after each character variable.

 Good programming practice dictates that all steps end with an explicit step boundary, including the DATA step.

 The OBS= option selects the last observation for processing. Because procedures start with the first observation by default, this step uses the first five observations from the Utility2001 data set, as shown in Output 2.8.1.

Output 2.8.1: Reading the Utility 2001 Data (Partial Listing)

ObsSerialElectricGasWaterFuel
111800999899989998
22480144099989998
3320403601009998
44300099983609998
558401320909998

In Program 2.8.1, Serial is read as a character variable; however, it contains only digits and therefore can be stored as numeric. The major advantage in storing Serial as character is size—its maximum value is six digits long and therefore requires six bytes of storage as character, while all numeric variables have a default size of eight bytes. The major disadvantage to storing Serial as character is ordering—for example, as a character value, 11 comes before 2. While the other four variables can be read as character as well, it is a very poor choice as no mathematical or statistical operations can be done on those values. For examples in subsequent sections, Serial is read as numeric.

In Program 2.8.1, the INFILE statement is used to specify the raw data file that the DATA step reads. In general, the INFILE statement may include references to a single file or to multiple files, with each reference provided one of the following ways:

 A physical path to the files. Physical paths can be either relative or absolute.

 A file reference created via the FILENAME statement.

Program 2.8.1 is set up to use the first method, with either an absolute or relative path chosen. An absolute path starts with a drive letter or name, while any other specification is a relative path. All relative paths are built from the current working directory. (Refer to Section 1.5 for a discussion of the working directory and setting its value.) It is often more efficient to use a FILENAME statement to build references to external files or folders. Programs 2.8.2 and 2.8.3 demonstrate these uses of the FILENAME statement, producing the same data set as Program 2.8.1.

Program 2.8.2: Using the FILENAME Statement to Point to an Individual File

filename Util2001  “--insert path here--\Utility 2001.prn”;

data work.Utility2001A;

infile Util2001;

input Serial$ Electric Gas Water Fuel;

run;

 The FILENAME statement creates a file reference, called a fileref, named Util2001. Naming conventions for a fileref are the same as those for a libref.

 The path specified, which can be relative or absolute as in Program 2.8.1, includes the file name. SAS assigns the fileref Util2001 to this file.

 The INFILE statement now references the fileref Util2001 rather than the path or file name. Note, quotation marks are not used on Util2001 since it is to be interpreted as a fileref and not a file name or path.

Program 2.8.3: Associating the FILENAME Statement with a Folder

filename RawData ‘--insert path to folder here--’; 

data work.Utility2001B;

infile RawData(“Utility 2001.prn”);

input Serial$ Electric Gas Water Fuel;

run;

 It is assumed here that the path, either relative or absolute, points to a folder and not a specific file. In that case, the FILENAME statement associates a folder with the fileref RawData. The path specified should be to the folder containing the raw files downloaded from the author page, much like the BookData library was assigned to the folder containing the SAS data sets.

 The INFILE statement references both the fileref and the file name. Although the file reference can be made without the quotation marks in certain cases, good programming practice includes the quotation marks.

Since each of Programs 2.8.2 and 2.8.3 generate the same result as Program 2.8.1 but actually require slightly more code, the benefits of using the FILENAME statement may not be obvious. The form of the FILENAME in Program 2.8.3 is useful if a single file needs to be read repeatedly under different conditions, allowing the multiple references to that file to be shortened. More commonly, the form used in Program 2.8.4 is more efficient when reading multiple files from a common location. Again, if the path specified is to the folder containing the raw files downloaded from the author page, the fileref RawData refers to the location for all non-SAS data sets used in examples for Chapters 2 through 7.

2.8.2 More with List Input

Input Data 2.8.4 includes a partial representation of the first five records from a comma-delimited file (IPUMS2005Basic.csv). Due to the width of the file, Input Data 2.8.4 truncates the third and fifth records.

Input Data 2.8.4: Comma Delimited Raw File (Partial Listing)

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2,Alabama,Not in identifiable city (or size group),0,4,73,Rented,N/A,0,12000,9999999
3,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,17800,9999999
4,Alabama,Not in identifiable city (or size group),0,4,73,Owned,”Yes, mortgaged/ deed
5,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,2000,9999999
6,Alabama,Not in identifiable city (or size group),0,3,97,Owned,”No, owned free and

Not only is this file delimited by commas, but the eighth field on the third and fifth rows also includes data values containing a comma, with those values embedded in quotation marks. (Recall these records are truncated in the text due to their length so the final quote is not shown for these two records.) To successfully read this file, the DATA step must recognize the delimiter as a comma, but also that commas embedded in quoted values are not delimiters. The DSD option is introduced in Program 2.8.4 to read such a file.

Program 2.8.4: Reading the 2005 Basic IPUMS CPS Data

data work.Ipums2005Basic;

infile RawData(“IPUMS2005basic.csv”) dsd;

input Serial State $ City $ CityPop Metro

CountyFIPS Ownership $ MortgageStatus $

MortgagePayment HHIncome HomeValue;

run;

proc print data = work.Ipums2005Basic (obs=5);

run;

 The DSD option included in the INFILE statement modifies the delimiter and some additional default behavior as listed below.

 Again, the INPUT statement names each of the variables read from the raw file in the INFILE statement and sets their types. By default, SAS assumes the incoming variables are numeric; however, State, City, Ownership, and MortgageStatus must be read as character values.

Output 2.8.4 shows that, while Program 2.8.4 executes successfully, the resulting data set does not correctly represent the values from Input Data 2.8.4—the City and MortgageStatus variables are truncated. This truncation occurs due to the default length of 8 assigned to character variables; therefore, SAS did not allocate enough memory to store the values in their entirety. Only the first five records are shown; however, further investigation reveals this truncation occurs for the variable State as well.

Output 2.8.4: Reading the 2005 Basic IPUMS CPS Data (Partial Listing).

ObsSerialStateCityCityPopMetroCountyFIPSOwnership
12AlabamaNot in i0473Rented
23AlabamaNot in i010Rented
34AlabamaNot in i0473Owned
45AlabamaNot in i010Rented
56AlabamaNot in i0397Owned
ObsMortgageStatusMortgagePaymentHHIncomeHomeValue
1N/A0120009999999
2N/A0178009999999
3Yes, mor900185000137500
4N/A020009999999
5No, owne07260095000

Program 2.8.4 uses the DSD option in the INFILE statement to change three default behaviors:

1. Change the delimiter to comma

2. Treat two consecutive delimiters as a missing value

3. Treat delimiters inside quoted strings as part of a character value and strip off the quotation marks

For Input Data 2.8.4, the first and third actions are necessary to successfully match the structure of the delimiters in the data since (a) the file uses commas as delimiters and (b) commas are included in the quoted strings in the data for the MortgageStatus variable. Because the file does not contain consecutive delimiters, the second modification has no effect.

Of course, it might be necessary to produce the second and third effects while using blanks—or any other character—as the delimiter. It is also often necessary to change the delimiter without making the other modifications included with the DSD option. In those cases, use the DLM= option to specify one or more delimiters by placing them in a single set of quotation marks, as shown in the following examples.

1. DLM = ‘/’ causes SAS to move to a new field when it encounters a forward slash

2. DLM = ‘, ‘ causes SAS to move to a new field when it encounters a comma

3. DLM = ‘,/’ causes SAS to move to a new field when it encounters either a comma or forward slash

Introduction to Variable Attributes

In SAS, the amount of memory allocated to a variable is called the variable’s length; length is one of several attributes that each variable possesses. Other attributes include the name of the variable, its position in the data set (1st column, 2nd column, ...), and its type (character or numeric). As with all the variable attributes, the length is set either by use of a default value or by explicitly setting a value.

By default, both numeric and character variables have a length of eight bytes. For character variables, one byte of memory can hold one character in the English language. Thus, the DATA step truncates several values of State, City, and MortgageStatus from Input Data 2.8.4 since they exceed the default length of eight bytes. For numeric variables, the default length of eight bytes is sufficient to store up to 16 decimal digits (commonly known as double-precision). When using the Microsoft Windows® operating system, numeric variables have a minimum allowable length of three bytes and a maximum length of eight bytes. Character variables may have a minimum length of 1 byte and a maximum length of 32,767 bytes. While there are many options and statements that affect the length of a variable implicitly, the LENGTH statement allows for explicit declaration of the length and type attributes for any variables. Program 2.8.5 demonstrates the usage of the LENGTH statement.

Program 2.8.5: Using the LENGTH Statement

data work.Ipums2005Basic;

length state $ 20 City$ 25 MortgageStatus$50; 

infile RawData(“IPUMS2005basic.csv”) dsd;

input Serial State City  CityPop Metro

CountyFIPS Ownership $  MortgageStatus$ 

MortgagePayment HHIncome HomeValue;

run;

proc print data = work.Ipums2005Basic(obs = 5);

run;

 The LENGTH statement sets the lengths of State, City, and MortgageStatus to 20, 25, and 50 characters, respectively, with the dollar sign indicating these are character variables. Separating the dollar sign from the variable name or length value is optional, though good programming practices dictate using a consistent style to improve readability.

 Type (character or numeric) is an attribute that cannot be changed in the DATA step once it has been established. Because the LENGTH statement sets these variables as character, the dollar sign is optional in the INPUT statement. However, good programming practices generally dictate including it for readability and so that removal of the LENGTH statement does not lead to a data type mismatch. (This would be an execution-time error.)

 As in , the spacing between the dollar sign and variable name is optional in the INPUT statement as well. Good programming practices still dictate selecting a consistent spacing style.

Output 2.8.5 shows the results of explicitly setting the length of the State, City, and MortgageStatus variables. In addition to the lengths of these three variables changing, their column position in the SAS data set has changed as well. Variables are added to the data set based on the order they are encountered during compilation of the DATA step, so since the LENGTH statement precedes the INPUT statement, it has actually changed two attributes—length and position—for these three variables (while also defining the type attribute as character).

Output 2.8.5: Using the LENGTH Statement (Partial Listing)

ObsstateCityMortgageStatusSerialCityPopMetroCountyFIPS
1AlabamaNot in identifiable cityN/A20473
2AlabamaNot in identifiable cityN/A3010
3AlabamaNot in identifiable cityYes, mortgaged/ deed of trust or similar debt40473
4AlabamaNot in identifiable cityN/A5010
5AlabamaNot in identifiable cityNo, owned free and clear60397
ObsOwnershipMortgagePaymentHHIncomeHomeValue
1Rented0120009999999
2Rented0178009999999
3Owned900185000137500
4Rented020009999999
5Owned07260095000

Like the type attribute, SAS does not allow the position and length attributes to change after their initial values are set. Attempting to change the length attribute after the INPUT statement, as shown in Program 2.8.6, results in a warning in the Log.

Program 2.8.6: Using the LENGTH Statement After the INPUT Statement

data work.Ipums2005Basic;

infile RawData(“IPUMS2005basic.csv”) dsd;

input Serial State $ City $ CityPop Metro

CountyFIPS Ownership $ MortgageStatus $

MortgagePayment HHIncome HomeValue;

length state $20 City $25 MortgageStatus $50;

run;

Log 2.8.6: Warning Generated by Attempting to Reset Length

WARNING: Length of character variable State has already been set. Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

Tab-Delimited Files

If the delimiter is not a standard keyboard character, such as the tab used in tab-delimited files, an alternate method is used to specify the delimiter via its hexadecimal code. While the correct hexadecimal representation depends on the operating system, Microsoft Windows and Unix/Linux machines typically use ASCII codes. The ASCII hexadecimal code for a tab is 09 and is written in the SAS language as ‘09 ‘x; the x appended to the literal value of 09 instructs the compiler to make the conversion from hexadecimal. Program 2.8.7 uses hexadecimal encoding in the DLM= option to correctly set the delimiter to a tab. The results of Program 2.8.7 are identical to those of Program 2.8.5.

Program 2.8.7: Reading Tab-Delimited Data

data work.Ipums2005Basic;

length state $ 20 City $ 25 MortgageStatus $ 50;

infile RawData (‘ipums2005basic.txt’) dlm = ‘09’x;

input Serial State $ City $ CityPop Metro

CountyFIPS Ownership $ MortgageStatus $

MortgagePayment HHIncome HomeValue;

run;

Because there are no missing values denoted by sequential tabs, nor any tabs included in data values, the DSD option is no longer needed in the INFILE statement for this program.

To specify multiple delimiters that include the tab, each must use a hexadecimal representation—for example, DLM= ‘2C09’x selects commas and tabs as delimiters since 2C is the hexadecimal value for a comma. For records with different delimiters within the same DATA step, see Chapter Note 7 in Section 2.12.

2.8.3 Introduction to Reading Fixed-Position Data

While delimited data takes advantage of delimiting characters in the data, other files depend on the starting and stopping position of the values being read. These types of files are referred to by several names: fixed-width, fixed-position, and fixed-field, among others. The first five records from a fixed-position file (IPUMS2005Basic.dat) are shown in Input Data 2.8.8. As with Input Data 2.8.4, truncation of this display occurs due to the length of the record—now occurring in each of the five records.

Input Data 2.8.8: Excerpt from a Fixed-Position Data File

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2 Alabama Not in identifiable city (or size group) 0 4 73
3 Alabama Not in identifiable city (or size group) 0 1 0
4 Alabama Not in identifiable city (or size group) 0 4 73
5 Alabama Not in identifiable city (or size group) 0 1 0
6 Alabama Not in identifiable city (or size group) 0 3 97

Since fixed-position files do not use delimiters, reading a fixed-position file requires knowledge of the starting position of each data value. In addition, either the length or stopping position of the data value must be known. Using the ruler, the first displayed field, Serial, appears to begin and end in column 8. However, inspection of the complete raw file reveals that is only the case for the single-digit values of Serial. The longest value is eight-digits wide, so the variable Serial truly starts in column 1 and ends in column 8. Similarly, the next field, State, begins in column 10 and ends in column 29. Some text editors, such as Notepad++ and Visual Studio Code, show the column number in a status bar as the cursor moves across lines in the file.

The DATA step for reading fixed-position data looks similar to the DATA step for reading delimited data, but there are several important modifications. For fixed-position files, the syntax of the INPUT statement provides information about column positions of the variable values in the raw file, as it cannot rely on delimiters for separating values. Therefore, delimiter-modifying INFILE options such as DSD and DLM= have no utility with fixed-position data. Two different forms of input are commonly used for fixed-position data: column input or formatted input. This section focuses on column input while Chapter 4 discusses formatted input.

Column Input

Column input takes advantage of the fixed positions in which variable values are found by directly placing the starting and ending column positions into the INPUT statement. Program 2.8.8 shows how to use column input to read the IPUMS CPS 2005 basic data. The results of Program 2.8.8 are identical to Output 2.8.5.

Program 2.8.8: Reading Data Using Column Input

data work.ipums2005basicFPa;

infile RawData (‘ipums2005basic.dat’);

input serial 1-8 state $ 10-29  city $ 31-70  cityPop 72-76 

metro 78-80 countyFips 82-84 ownership $ 86-91

mortgageStatus $ 93-137 mortgagePayment 139-142

HHIncome 144-150 homeValue 152-158;

run;

 The LENGTH statement is no longer needed—when using column input, SAS assigns the length based on the number of columns read if the length attribute is not previously defined. Here, SAS assigns State a length of 20 bytes, just as was done in the LENGTH statement in Program 2.8.5.

 The first value indicates the column position—31—from which SAS should start reading for the current variable, City. The second number—70—indicates the last column SAS reads to determine the value of City.

 The default length of eight bytes is still used for numeric variables, regardless of the number of columns.

Beyond the differences between column input and list input shown in Program 2.8.8, since column input uses the column positions, the INPUT statement can read variables in any order, and can even reread columns if necessary. Furthermore, the INPUT statement can skip unwanted variables. Program 2.8.9 reads Input Data 2.8.8 and demonstrates the ability to reorder and reread columns.

Program 2.8.9: Reading the Input Variables Differently than Column Order

data work.ipums2005basicFPb;

infile RawData(‘ipums2005basic.dat’);

input serial 1-8 hhIncome 144-150 homeValue 152-158 

ownership $ 86-91 ownershipCoded $ 86 

state $ 10-29 city $ 31-70 cityPop 72-76

metro 78-80 countyFips 82-84

mortgageStatus $ 93-137 mortgagePayment 139-142;

run;

proc print data = work.ipums2005basicFPb(obs = 5);

var serial -- state;

run;

 Output 2.8.9 shows that HHIncome and HomeValue are now earlier in the data set. Column input allows for reading variables in a user-specified order.

 Column 86 is read twice: first as part of a full value for Ownership, and second as a simplified version using only the first character as the value of a new variable, OwnershipCoded.

 As discussed in Chapter Note 3 in Section 1.7, the double-dash selects all variables between Serial and State, inclusive.

Output 2.8.9: Reading the Input Variables Differently than Column Order

ObsserialhhIncomehomeValueownershipownershipCodedstate
12120009999999RentedRAlabama
23178009999999RentedRAlabama
34185000137500OwnedOAlabama
4520009999999RentedRAlabama
567260095000OwnedOAlabama

Mixed Input

Programs 2.8.1 through 2.8.7 make use of simple list input for every variable, and Programs 2.8.8 and 2.8.9 use column input for every variable. However, it may not always be the case of making a choice between one or the other. If files contain some delimited fields while other fields have fixed positions, it is necessary to use multiple input styles simultaneously. This process, called mixed input, requires mastery of two other input methods covered in Chapter 3, modified list input and formatted input, along with a substantial understanding of how the DATA step processes raw data. For a discussion of the fifth and final input style, named input, see the SAS Documentation.

2.9 Details of the DATA Step Process

This section provides further details about how the DATA step functions. While this material can initially be considered optional for many readers, understanding it makes writing high-quality code easier by providing a foundation for how certain coding decisions lead to particular outcomes. This material is also essential for successful completion of the base certification exam.

2.9.1 Introduction to the Compilation and Execution Phases

SAS processes every step in Base SAS, including the DATA step, in two phases: compilation and execution. Each of the DATA steps seen so far in this text have several elements in common: they each read data from one or more sources (for example, a SAS data set or a raw data file), and they each create a data set as a result of the DATA step. For DATA steps such as these, the flowchart in Figure 2.9.1 provides a high-level overview of the actions taken by SAS upon submission of a DATA step. Details about the individual actions are included in this section, in the Chapter Notes in Section 2.12, and in subsequent chapters.

Figure 2.9.1: Flowchart of Default DATA Step Actions


Compilation Phase

During the compilation phase, SAS begins by tokenizing the submitted code and sending complete statements to the compiler. (For more details, see Chapter Note 8 in Section 2.12.) Once a complete statement is sent to the compiler, the compiler checks the statement for syntax errors. If there is a syntax error, SAS attempts to make a substitution that creates legal syntax and prints a warning to the SAS log indicating the substitution made. For example, misspelling the keyword DATA as DAAT produces the following warning.

WARNING 14-169: Assuming the symbol DATA was misspelled as daat.

Be sure to review these warnings and correct the syntax even if SAS makes an appropriate substitution. If there is a syntax error and SAS cannot make a substitution, then an error message is printed to the log, and the current step is not executed. For example, misspelling the keyword DATA as DSTS results in the following error.

ERROR 180-322: Statement is not valid or it is used out of proper order.

If there is not a syntax error, or if SAS can make a substitution to correct a syntax error, then the compilation phase continues to the next statement, tokenizes it, and checks it for syntax errors. This process continues until SAS compiles all statements in the current DATA step.

When reading raw data, SAS creates an input buffer to load individual records from the raw data and creates the program data vector to assign the parsed values to variables for later delivery to the SAS data set. During this process, SAS also creates the shell for the descriptor portion, or metadata, for the data set, which is accessible via procedures such as the CONTENTS procedure from Chapter 1. Of course, not all elements of the descriptor portion, such as the number of observations, are known during the compilation phase. Once the compilation phase ends, SAS enters the execution phase where the compiled code is executed. At the conclusion of the execution phase, SAS populates any such remaining elements of the descriptor portion.

Execution Phase

The compilation phase creates the input buffer (when reading from a raw data source) and creates the program data vector; however, it is the execution phase that populates them. SAS begins by initializing the variables in the program data vector based on data type (character or numeric) and variable origin (for example, a raw data file or a SAS data set). SAS then executes the programming statements included in the DATA step. Certain statements, such as the LENGTH or FORMAT statements shown earlier in this chapter, are considered compile-time statements because SAS completes their actions during the compilation phase. Compile-time statements take effect during the compilation phase, and their effects cannot be altered during the execution phase. Statements active during the execution phase are referred to as execution-time statements.

Finally, when SAS encounters the RUN statement (or any other step boundary) the default actions are as follows:

1. output the current values of user-selected variables to the data set

2. return to the top of the DATA step

3. reset the values in the input buffer and program data vector

At this point, the input buffer (if it exists) is empty, and the program data vector variables are incremented/reinitialized as appropriate so that the execution phase can continue processing the incoming data set. For more information about step boundaries, see Chapter Note 9 in Section 2.12.

When reading in data from various sources, the execution phase ends when it is determined that no more data can or should be read, based on the programming statements in the DATA step. Because there are multiple factors that affect this, an in-depth discussion is not provided here. Instead, as each new technique for reading and combining data is presented, a review of when the DATA step execution phase ends is included. This chapter includes examples on reading a single raw data using an INFILE statement and, in this case, the execution phase ends when SAS encounters an end-of-file (EOF) marker in the incoming data source. For plain text files, the EOF marker is a non-printable character that lets software reading the file know that the file contains no further information. At the conclusion of the execution phase, SAS completes the content portion of the data set, which contains the data values, and finalizes the descriptor portion.

2.9.2 Building blocks of a Data Set: Input Buffers and Program Data Vectors

Input Buffer

When reading raw data, SAS needs to parse the characters from the plain text in order to determine the values to place in the data set. Parsing involves dividing the text into groups of characters and interpreting each group as a value for a variable. To facilitate this, the default is for SAS to read a single line of text from the raw file and place it into the input buffer—a section of logical memory. In the input buffer, SAS places each character into its own column and uses a column pointer to keep track of the column the INPUT statement is currently reading.

Program Data Vector

Regardless of the data source used in a DATA step (raw data files or SAS data sets), a program data vector (PDV) is created. Like the input buffer, the PDV is a section of logical memory; but, instead of storing raw, unstructured data, the PDV is where SAS stores variable values. SAS determines these values in potentially many ways: by parsing information in the input buffer, by reading values from structured sources such as Excel spreadsheets or SAS data sets, or by executing programming statements in the DATA step. Just as the input buffer holds a single line of raw text, the PDV holds only the values of each variable for a single record.

In addition to user-defined variables, SAS places automatic variables into the PDV. Two automatic variables, _N_ and _ERROR_, are present in every PDV. By default, the DATA step acts as a loop that repeatedly processes any executable statements and builds the final data set one record at a time. These loops are referred to as iterations and are tracked by the automatic variable, _N_. _N_ is a counter that keeps track of the number of DATA step iterations—how many times the DATA statement has executed—and is initialized to one at invocation of the DATA step. _N_ is not necessarily the same as the number of observations in the data set since programming elements are available to selectively output records to the final data set. Similarly, certain statements and options are available to only select a subset of the variables in the final data set.

The second automatic variable, _ERROR_, is an indicator that SAS initializes to zero and sets to one at the first instance of certain non-syntax errors. Details about the errors it tracks are discussed in Chapter Note 10 in Section 2.12. Automatic variables are not written to the resulting data set, though their values can be assigned to new variables or used in other DATA step programming statements.

Example

Some aspects of the compilation and execution phases are demonstrated below using a raw data set having the five variables shown in Input Data 2.9.1: flight number (FlightNum), flight date (Date), destination city (Destination), number of first-class passengers (FirstClass), and number of economy passengers (EconClass). The first line contains a ruler to help locate the values; it is not included in the raw file. Program 2.9.1 reads in this data set.

Input Data 2.9.1: Flights.prn data set

----+----1----+----2----+----3
439 12/11/2000 LAX 20 137
921 12/11/2000 DFW 20 131
114 12/12/2000 LAX 15 170
982 12/12/2000 dfw 5 85
439 12/13/2000 LAX 14 196
982 12/13/2000 DFW 15 116
431 12/14/2000 LaX 17 166
982 12/14/2000 DFW 7 88
114 12/15/2000 LAX 0 187
982 12/15/2000 DFW 14 31

Program 2.9.1: Demonstrating the Input Buffer and Program Data Vector (PDV)

data work.flights;

infile RawData(‘flights.prn’);

input FlightNum Date $ Destination $ FirstClass EconClass;

run;

During the compilation phase, SAS scans each statement for syntax errors, and finding none in this code, it creates various elements as each statement compiles. The DATA statement triggers the initial creation of the PDV with the two automatic variables: _N_ and _ERROR_. SAS then determines an input buffer is necessary when it encounters the INFILE statement. SAS automatically allocates the maximum amount of memory, 32,767 bytes, when it creates the input buffer. If explicit control is needed, the INFILE option LRECL= allows specification of a value. Compilation of the input statement completes the PDV with the five variables in the input statement established, in the same order they are encountered, along with their attributes.

A visual representation of the input buffer and PDV at various points in the compilation phase is given in Tables 2.9.1 through 2.9.10, with the input buffer showing 26 columns here for the sake of brevity—the actual size is 32,767 columns.

Table 2.9.1: Representation of Input Buffer During Compilation Phase

0102030405060708091011121314151617181920212223242526

Table 2.9.2: Representation of the PDV at the Completion of the Compilation Phase

_N__ERROR_FlightNumDateDestinationFirstClassEconClass

Note that while SAS is not case-sensitive when referencing variables, variable names are stored as they are first referenced.

Once the compilation phase has completed, the execution phase begins by initializing the variables in the PDV. Each of the user-defined variables in this example comes from a raw data file, so they are all initialized to missing. Recall missing numeric data is represented with a single period, and missing character values are represented as a null string. The automatic variables _N_ and _ERROR_ are initialized to one and zero, respectively, since this is the first iteration through the DATA step, and no errors tracked by _ERROR_ have been encountered.

Table 2.9.3: Representation of the PDV at the Beginning of the Execution Phase

_N__ERROR_FlightNumDateDestinationFirstClassEconClass
10...

When SAS encounters the INPUT statement on the first iteration of the DATA step, it reads the first line of data and places it in the input buffer with each character in its own column. Table 2.9.4 illustrates this for the first record of Flights.prn.

Table 2.9.4: Illustration of the Input Buffer After Reaching the INPUT Statement on the First Iteration

0102030405060708091011121314151617181920212223242526
43912/11/2000LAX20137

To move raw data from the input buffer to the PDV, SAS must parse the character string from the input buffer to determine which characters should be grouped together and whether any of these character groupings must be converted to numeric values. The parsing process uses information found in the INFILE statement (for example, DSD and DLM=), the INPUT statement (such as the $ for character data), and other sources (like the LENGTH statement) to determine the values it places in the PDV.

No delimiter options are present in the INFILE statement in Program 2.9.1; thus, a space is used as the default delimiter, and the first variable, FlightNum, is read using simple list input. SAS uses column pointers to keep track of where the parsing begins and ends for each variable and, with simple list input, SAS begins in the first column and scans until the first non-delimiter character is found. In this record, the first column is non-blank, so the starting pointer is placed there, indicated by the blue triangle below Table 2.9.5. Next, SAS scans until it finds another delimiter (a blank in this case), which is indicated below Table 2.9.5 with the red octagon. Thus, when reading the input buffer to create FlightNum, SAS has read from column 1 up to column 4.

Table 2.9.5: Column Pointers at the Starting and Ending Positions for Parsing FlightNum

0102030405060708091011121314151617181920212223242526
43912/11/2000LAX20137

Based on information defined in the descriptor portion of the data during compilation of the INPUT statement, FlightNum is a numeric variable with a default length of eight bytes. There are no additional instructions on how this value should be handled, so SAS converts the extracted character string “439” to the number 439 and sends it to the PDV. Note that the blank found in column 4 is not part of the parsed value—only non-delimiter columns are included. Table 2.9.6 shows the results of parsing FlightNum from the first record.

Table 2.9.6: Representation of the PDV After FlightNum is Read During the First Iteration

_N__ERROR_FlightNumDateDestinationFirstClassEconClass
10439..

Before parsing begins for the next value, SAS advances the column pointer one position, in this case advancing to column 5. This prevents SAS from beginning the next value in a column that was used to create a previous value. This automatic advancement of a single column occurs regardless of the input style, even though it is only demonstrated here for simple list input.

Since Date is also read in using simple list input, the parsing process is similar to how FlightNum was read. SAS begins at column 5 and reads until it encounters the next delimiter, which is in column 15. Table 2.9.7 shows this with, as before, the blue triangle indicating the starting column and the red octagon the ending column.

Table 2.9.7: Column Pointers at the Starting and Ending Positions for Parsing FlightNum

0102030405060708091011121314151617181920212223242526
43912/11/2000LAX20137

The characters read for Date are “12/11/2000”, and SAS must parse this string using the provided instructions, which are given by the dollar sign used in the INPUT statement for this variable. This declares its type as character and, since there are no instructions about the length, the default length of eight is used. Unlike numeric variables, where the length attribute does not control the displayed width, the length of a character variable is typically equal to the number of characters that can be stored. (Check the SAS Documentation to determine how length is related to printed width for character values in various languages and encodings.) The resulting value for date, shown in Table 2.9.8, has been truncated to the first 8 characters. This highlights that while list input reads from delimiter to delimiter, it only stores the value parsed from the input buffer subject to any attributes, such as type and length, previously established.

Table 2.9.8: Representation of the PDV after Date is Read During the First Iteration

_N__ERROR_FlightNumDateDestinationFirstClassEconClass
1043912/11/20..

As demonstrated in Program 2.8.5, one way to prevent the truncation of Date values is to use a LENGTH statement to set the length of Date to at least 10 bytes. Another means of avoiding this issue with Date is to read it with an informat, a concept covered in Chapter 3. This has the added benefits of converting the Date values to numeric; allowing for easier sorting, computations, and changes in display formats.

SAS continues through the input buffer and, since all variables in this example are read using simple list input, the reading and parsing follows the same process as before. Table 2.9.9 shows the starting and stopping position of each of the remaining variables. Note that because SAS automatically advances the column pointer by one column after every variable and because list input scans for the next non-delimiter, the starting position for Destination in this record is column 17 rather than column 16.

Table 2.9.9: Column Pointers at the Starting and Ending Positions for Parsing the Remaining Values

0102030405060708091011121314151617181920212223242526
43912/11/2000LAX20137

Table 2.9.10 contains the final PDV for the first record—the values that are sent to the data set at the end of the first iteration of the DATA step which corresponds to the location of the RUN statement (or other step boundary).

Table 2.9.10: Representation of the PDV After Reading All Values

_N__ERROR_FlightNumDateDestinationFirstClassEconClass
1043912/11/20LAX20137

After this record is sent to the data set, the execution phase returns to the top of the DATA step and the variables are reinitialized as needed. _N_ is incremented to 2, _ERROR_ remains at zero since no tracked errors have been encountered, and the remaining variables are set back to missing in this case. Further iterations continue the process: the next row is loaded into the input buffer, values are parsed to the PDV, and those values are sent to the data set at the bottom of the DATA step. This implicit iteration terminates when the end-of-file marker is encountered.

2.9.3 Debugging the DATA Step

Following the process from Section 2.9.1 may seem tedious at first, but understanding how SAS parses data when moving it from the raw file through the input buffer then to the PDV (and ultimately to the data set) is crucial for success in more complex cases. It is often inefficient to develop a program using a trial-and-error approach; instead, knowledge of the data-handling process ensures a smoother, more reliable process for developing programs. This section discusses several statements that SAS provides to help follow aspects of the parsing process through iterations of the DATA step. Program 2.9.2 demonstrates the LIST statement using Input Data 2.9.1.

Program 2.9.2: Demonstrating the LIST Statements

data work.flights;

infile RawData(‘flights.prn’);

input FlightNum Date $ Destination $ FirstClass EconClass;

list;

run;

The LIST statement writes the contents of the input buffer to the log at the end of each iteration of the DATA step, placing a ruler before the first observation is printed. Log 2.9.2 shows the results of including the LIST statement in Program 2.9.2 for the first five records. The complete input buffer, including the delimiters, appear for each record.

Log 2.9.2: Demonstrating the LIST Statements

RULE: ----+----1----+----2----+----3----+----4

1 439 12/11/2000 LAX 20 137  26 

2 921 12/11/2000 DFW 20 131 25

3 114 12/12/2000 LAX 15 170 26

4 982 12/12/2000 dfw 5 85 25

5 439 12/13/2000 LAX 14 196 25

 Before writing the input buffer contents for the first time, the INPUT statement prints a ruler to the log.

 The LIST statement writes the complete input buffer for the record.

 If the records have variable length, then the LIST statement also prints the number of characters in the input buffer.

Since the log is a plain-text environment, SAS cannot display non-printable characters such as tabs. However, in these cases, SAS prints additional information to the log to ensure an unambiguous representation of the input buffer. Program 2.9.3 demonstrates the results of using the LIST statement with a tab-delimited file.

Program 2.9.3: LIST Statement Results with Non-Printable Characters

data work.flights;

infile RawData(‘flights.txt’) dlm = ‘09’x;

input FlightNum Date $ Destination $ FirstClass EconClass;

list;

run;

 Note the different file extension. This data is similar to the data in Program 2.9.2, but in a tab-delimited file.

 The DLM= option uses the hexadecimal representation of a tab, 09, along with hexadecimal literal modifier, x.

 No change is necessary in the LIST statement, regardless of what, if any, delimiter is used in the file.

Because of the increase in information present in the log, Log 2.9.3 only shows the results for the first two records. For each record, the contents from the input buffer now occupy three lines in the log.

Log 2.9.3: LIST Statement Results with Non-Printable Characters

RULE:  ----+----1----+----2----+----3----+----4

1 CHAR 439.12/11/2000.LAX.20.137 25

ZONE 3330332332333304450330333

NUMR 439912F11F20009C189209137

 

2 CHAR 921.12/11/2000.DFW.20.131 25

ZONE 3330332332333304450330333

NUMR 921912F11F200094679209131

 The CHAR line represents the printable data from the input buffer. It displays non-printable characters as periods.

 The ZONE and NUMR rows represent the two digits in the hexadecimal representation.

 Note the fourth column in the input buffer appears to be a period. However, combining the information from the ZONE and NUMR lines indicates the hexadecimal value is 09—a tab.

Because SAS converts all non-printable characters to a period when writing the input buffer to the log, the ZONE and NUMR lines provide crucial information to determine the actual value stored in the input buffer. In particular, they provide a way to differentiate a period that was in the original data (hexadecimal code 2E) from a period that appears as a result of a non-printable character (for example, a tab with the hexadecimal code 09).

When debugging, two other useful statements are the PUT and PUTLOG statements. Both PUT and PUTLOG statements provide a way for SAS to write out information from the PDV along with other user-defined messages. The statements differ only in their destination—the PUTLOG statement can only write to the SAS log, while the PUT statement can write to the log or any file destination specified in the DATA step. The PUT statement is covered in more detail in Chapter 7; this section focuses on the PUTLOG statement. Program 2.9.4 uses Input Data 2.9.1 to demonstrate various uses of the PUTLOG statement, with Log 2.9.4 showing the results for the first record.

Program 2.9.4: Demonstrating the PUTLOG Statement

data work.flights;

infile RawData(‘flights.prn’);

input FlightNum Date $ Destination $ FirstClass EconClass;

putlog ‘NOTE: It is easy to write the PDV to the log’;

putlog _all_;

putlog ‘NOTE: Selecting individual variables is also easy’;

putlog ‘NOTE: ‘  FlightNum=  Date ;

putlog ‘WARNING: Without the equals sign variable names are omitted’;

run;

 This PUTLOG statement writes the quoted string to the log once for every record. If the string starts with the string ‘NOTE:’, then SAS color-codes the statement just like a system-generated note, and it is indexed in SAS University Edition with other system notes.

 The _ALL_ keyword selects every variable from the PDV, including the automatic variables _N_ and _ERROR_.

 The PUTLOG statements accept a mix of quoted strings and variable names or keywords.

 A variable name followed by the equals sign prints both the name and current value of the variable.

 Omitting the equals sign only prints the value of the variable, with a potentially adverse effect on readability.

 Beginning the string with ‘WARNING:’ or ‘ERROR:’ also ensures SAS color-codes the messages to match the formatting of the system-generated warnings and errors, and indexes them in SAS University Edition. If this behavior is not desired, use alternate terms such as QC_NOTE, QC_WARNING, and QC_ERROR to differentiate these user-generated quality control messages from automatically generated messages.

Log 2.9.4 Demonstrating the PUTLOG Statement

NOTE: It is easy to write the PDV to the log

FlightNum=439 Date=12/11/20 Destination=LAX FirstClass=20 EconClass=137 _ERROR_=0 _N_=1

NOTE: Selecting individual variables is also easy

FlightNum=439 12/11/20

WARNING: Without the equals sign variable names are omitted

When used in conjunction in a single DATA step, the PUTLOG and LIST statements allow for easy access to both the PDV and input buffer contents, providing a simple way to track down the source of data errors. In these examples, the PUTLOG results shown in Log 2.9.4 reveal the truncation in the Date values in the PDV, while the LIST results from Log 2.9.2 or 2.9.3 show the full date is present in the input buffer. Using them together, it is clear the issue with Date values is not present in the raw data and must relate to how the INPUT statement has parsed those values from the input buffer.

One additional debugging tool, the ERROR statement, acts exactly as the PUTLOG statement does, while also setting the automatic variable _ERROR_ to one.

2.10 Validation

The term validation has many definitions in the context of programming. It can refer to ensuring software is correctly performing the intended actions—for example, confirming that PROC MEANS is accurately calculating statistics on the provided data set. Validation can also refer to the processes of self-validation, independent validation, or both. Self-validation occurs when a programmer performs checks to ensure any data sets read in accurately reflect the source data, data set manipulations are correct, or that any analyses represent the input data sets. Independent validation occurs when two programmers develop code separately to achieve the same results. Once both programmers have produced their intended results, those results (such as listings of data sets, tables of summary statistics, or graphics) are compared and are considered validated when no substantive differences appear between the two sets of results. SAS provides a variety of tools to use for validation, including PROC COMPARE, which is an essential tool for independent validation of data sets.

The COMPARE procedure compares the contents of two SAS data sets, selected variables across different data sets, or selected variables within a single data set. When two data sets are used, one is specified as the base data set and the other as the comparison data set. Program 2.10.1 shows the options used to specify these data sets, BASE= and COMPARE=, respectively.

Program 2.10.1: Comparing the Contents of Two Data Sets

proc compare base = sashelp.fish compare = sashelp.heart;

run;

If no statements beyond those shown in Program 2.10.1 are included, the complete contents portions of the two data sets are compared, along with meta-information such as types, lengths, labels, and formats. PROC COMPARE only compares attributes and values for variables with common names across the two data sets. Thus, even though the full data sets are specified for comparison, it is possible for individual variables in one data set to not be compared against any variable in the other set. Program 2.10.1 compares two data sets from the Sashelp library: Fish and Heart. These data sets are not intended to match, so submitting Program 2.10.1 produces a summary of the mismatches to demonstrate the types of output available from the COMPARE procedure.

Output 2.10.1: Comparing the Contents of Two Data Sets


While the output in this text is normally delivered in an RTF format using the Output Delivery System, the output from PROC COMPARE is not well-suited to such an environment, so Output 2.10.1 shows the results as they appear in the Output window in the SAS Windowing Environment. Regardless of the destination, the output from this example of the COMPARE procedure includes sections for the following:

 Data set summary—data set names, number of variables, number of observations

 Variables summary—number of variables in common, along with the number in each data set which are not found in the other set

 Observation summary—location of first/last unequal record and number of matching/nonmatching observations

 Values comparison summary—number of variables compared with matches/mismatches, listing of mismatched variables and their differences

A review of the output provided by PROC COMPARE shows, in this case, only two variables are compared, despite the Fish and Heart data sets containing 7 and 17 variables, respectively. This is because only two variables (Weight and Height) have names in common. As such, even if the results indicate the base and comparison data sets have no mismatches, it is important to confirm that all variables were compared before declaring the data sets are identical. Similarly, the number of records compared is the minimum of the number of records in the two data sets, so the number of records must be compared as well. Several options and statements exist to alter how comparisons are done and to direct some comparison information to data sets.

Since the Heart and Fish data sets are not expected to be similar, applying PROC COMPARE to them is a simplistic demonstration of the procedure. A more typical comparison is given in Program 2.10.2, which applies the COMPARE procedure to the data set read in by Program 2.8.8 (using fixed-position data) and the IPUMS2005Basic data set in the BookData library.

Program 2.10.2: Comparing IPUMS 2005 Basic Data Generated from Different Sources

data work.ipums2005basicSubset;

set work.ipums2005basicFPa;

where homeValue ne 9999999;

run;

proc compare base = BookData.ipums2005basic compare = work.ipums2005basicSubset

out = work.diff  outbase  outcompare  outdif  outnoequal 

method = absolute  criterion = 1E-9 ;

run;

proc print data = work.diff(obs=6);

var _type_ _obs_ serial countyfips metro citypop homevalue;

run;

proc print data = work.diff(obs=6);

var _type_ _obs_ city ownership;

run;

 To create a data set which differs from the provided BookData.IPUMS2005Basic data set, a WHERE statement is used to remove any homes with a home value of $9,999,999.

 OUT= produces a data set containing information about the differences for each pair of compared observations for all matching variables. SAS includes all compared variables and two automatic variables, _TYPE_ and _OBS_.

 OUTBASE copies the record being compared in the BASE= data set into the OUT= data set.

 Like OUTBASE, OUTCOMPARE copies the record being compared in the COMPARE= data set into the OUT= data set.

 OUTDIF produces a record that contains the difference between the OUTBASE and OUTCOMPARE records.

Fundamentals of Programming in SAS

Подняться наверх