Читать книгу Applied Data Mining for Forecasting Using SAS - Tim Rey - Страница 8

Оглавление

Chapter 2: Data Mining for Forecasting Work Process

2.1 Introduction

2.2 Work Process Description

2.2.1 Generic Flowchart

2.2.2 Key Steps

2.3 Work Process with SAS Tools

2.3.1 Data Preparation Steps with SAS Tools

2.3.2 Variable Reduction and Selection Steps with SAS Tools

2.3.3 Forecasting Steps with SAS Tools

2.3.4 Model Deployment Steps with SAS Tools

2.3.5 Model Maintenance Steps with SAS Tools

2.3.6 Guidance for SAS Tool Selection Related to Data Mining in Forecasting

2.4 Work Process Integration in Six Sigma

2.4.1 Six Sigma in Industry

2.4.2 The DMAIC Process

2.4.3 Integration with the DMAIC Process

Appendix: Project Charter

2.1 Introduction

This chapter describes a generic work process for implementing data mining in forecasting real-world applications. By work process the authors mean a sequence of steps that lead to effective project management. Defining and optimizing work processes is a must in industrial applications. Adopting such a systematic approach is critical in order to solve complex problems and introduce new methods. The result of using work processes is that productivity is increased and experience is leveraged in a consistent and effective way. One common mistake some practitioners make is jumping to real-world forecasting applications while focusing only on technical knowledge and ignoring the organizational and people-related issues. It is the authors' opinion that applying forecasting in a business setting without a properly defined work process is a clear recipe for failure.

The work process presented here includes a broader set of steps than the specific steps related to data mining and forecasting. It includes all necessary action items to define, develop, deploy, and support forecasting models. First, a generic flowchart and description of the key steps is given in the next section, followed by a specific illustration of the work process sequence when using different SAS tools. The last section is devoted to the integration of the proposed work process in one of the most popular business processes widely accepted in industry–Six Sigma.

2.2 Work Process Description

The objective of this section is to give the reader a condensed description of the necessary steps to run forecasting projects in the real world. We begin with a high-level overview of the whole sequence as a generic flowchart. Each key step in the work process is described briefly with its corresponding substeps and specific deliverables.

2.2.1 Generic Flowchart

The generic flowchart of the work process for developing, deploying, and maintenance of a forecasting project based on data mining is shown in Figure 2.1. The proposed sequence of action items includes all of the steps necessary for successful real-world applications–from defining the business objectives to organizing a reliable maintenance program to performance tracking of the applied forecasting models.

Figure 2.1: A Generic flowchart of the proposed work process


The forecasting project begins with a project definition phase. It gives a well-defined framework for approving the forecasting effort based on well-described business needs, allocated resources, and approved funding. As most practitioners already know, the next block—data preparation—often takes most of the time and the lion's share of the cost. It usually requires data extraction from internal and external sources and a lot of tricks to transfer the initial disarray in the data into a time series database acceptable for modeling and forecasting. The appropriate tricks are discussed in detail in Chapters 5 and 6.

The block for variable reduction and selection captures the corresponding activities, such as various data mining and modeling methods, that are used to take the initial broad range of potential inputs (Xs) that drive the targeted forecasting variables (outputs, Ys) to a short list of the most statistically significant factors. The next block includes the various forecasting techniques that generate the models for use. Usually, it takes several iterations along these blocks until the appropriate forecasting models are selected, reliably validated, and presented to the final user. The last step requires an effective consensus building process with all stakeholders. This loop is called the model development cycle.

The last three blocks in the generic flowchart in Figure 2.1 represent the key activities when the selected forecasting models are transferred from a development environment to a production mode. This requires automating some steps in the model development sequence, including the monitoring of data quality and forecasting performance. Of critical importance is tracking the business performance metric as defined by its key performance indicators (KPIs), and tracking the model performance metric as defined by forecasting accuracy criteria. This loop is called the model deployment cycle in which the fate of the model depends on the rate of model performance degradation. In the worst-case scenario of consistent performance degradation, the whole model development sequence, including project definition, might be revised and executed again.

2.2.2 Key Steps

Each block of the work process is described by defining the related activities and detailed substeps. In addition, the expected deliverables are discussed and illustrated with examples when appropriate.

Project definition steps

The first key step in the work process—project definition—builds the basis for forecasting applications. It is the least formalized step in the sequence and requires proactive communication skills, effective teamwork, and accurate documentation. The key objectives are to define the business motivation for starting the forecasting project and to set up as much structure as possible in the problem by effective knowledge acquisition. This is to be done well before beginning the technical work. The corresponding substeps to accomplish this goal as well as the expected deliverables from project the definition phase are described below.

Project objectives definition

This is one of the most important and most often mishandled substeps in the work process. A key challenge is defining the economic impact from the improved forecasts through KPIs such as reduced cost, increased productivity, increased market share, and so on. In the case of demand-driven forecasting, it is all about getting the right product to the right customer at the right time for the right price. Thus, the value benefits can be defined as any of the following (Chase 2009):

 a reduction in the instances when retailers run out of stock

 a significant reduction in customer back orders

 a reduction in the finished goods inventory carrying costs

 consistently high levels of customer service across all products and services

It is strongly recommended to quantify each of these benefits (for example, “a 15% reduction in customer back orders on an annual basis relative to the accepted benchmark”).

An example of an appropriate business objective for a forecasting project follows:

More accurate forecasts will lead to proactive business decisions that will consistently increase annual profit by at least 10% for the next three years.

Another challenge is finding a forecasting performance metric that is measurable, can be tracked, and is appropriate for defining success. An example of an appropriate quantitative objective that satisfies these conditions is the following definition:

The technical objective of the project is to develop, deploy, and support, for at least three years, a quarterly forecasting model that projects the price of Product A for a two-year time horizon and that out-performs the accepted statistical benchmark (naïve forecasting in this case) by 20% based on the average of the last four consecutive quarterly forecasts.

The key challenge, however, is ensuring that the defined technical objective (improved forecasting) will lead to accomplishing the business goal (increased profitability).

Project scope definition

Defining the forecasting project scope also needs to be as specific as possible. It usually includes the business geography boundaries, business envelope, market segments covered, data history limits, forecasting frequency, and work process requirements. For example, the project scope might include boundaries such as the following: the developed forecasting model will predict the prices of Product A in Germany based on internal record of sales. The internal historical data to be used starts in January of 2001, uses quarterly data, and the project implementation has to be done in Six Sigma according to the standard requirements for model deployment and with the support of the Information Technologies department.

Project roles definition

Identifying appropriate stakeholders is another very important substep to take to ensure the success of forecasting projects. In the case of a typical large-scale business forecasting project, the following stakeholders are recommended as members of the project team:

 the management sponsor who provides the project funding

 the project owner who has the authority to allow changes in the existing business process

 the project leader who coordinates all project activities

 the model developers who develop, deploy, and maintain the models

 the subject matter experts—SMEs—who know the business process and the data

 the users (use the forecasting models on a regular basis)

System structure and data identification

The purpose of this substep is to capture and document the available knowledge about the system under consideration. This step provides a meaningful context for the necessary data and the data mining and forecasting steps. Knowledge acquisition usually takes several brainstorming sessions facilitated by model developers and attended by selected subject matter experts. The documentation may include process descriptions, market structure studies, system diagrams and process maps, relationship maps, etc. The authors' favorite technique for system structure and data identification is mind-mapping, which is a very convenient way of capturing knowledge and representing the system structure during the brainstorming sessions.

Mind-mapping (or concept mapping) involves writing down a central idea and thinking up new and related ideas that radiate out from the center.1 By focusing on key topics written down in SME's words, and then defining branches and connections between the topics, the knowledge of the SMEs can be mapped in a manner that will help understanding and document the details of knowledge necessary for future data and modeling activities. An example of a mind-map2 for system structure and data identification in the case of a forecasting project for Product A is shown in Figure 2.2.

The system structure, shown in the mind-map in Figure 2.2, includes three levels. The first level represents the key topics related to the project by radial branches from the central block named “Product A Price Forecasting.” In this case, according to the subject matter experts, the central topics are: Data, Competitors, Potential drivers, Business structure, Current price decision-making process, and Potential users. Each key topic can be structured in as many levels of detail as necessary. However, beyond three levels down, the overall system structure visualization becomes cumbersome and difficult to understand. An example of an expanded structure of the key topic Data down to the third level of detail is shown in Figure 2.2. The second level includes the two key types of data – internal and external. The third level of detail in the mind-map captures the necessary topics related to the internal and external data. All other key topics are represented in a similar way (not shown in Figure 2.2). The different levels of detail are selected by collapsing or expanding the corresponding blocks or the whole mind-map.

Figure 2.2: An example of a mind-map for Product A price forecasting


Project definition deliverables

The deliverables in this step are: (1) project charter, (2) team composition, and (3) approved funding. The most important deliverable in project definition is the charter. It is a critical document which in many cases defines the fate of the project. Writing a good charter is an iterative process which includes gradually reducing uncertainty related to objectives, deliverables, and available data. The common rule of thumb is this: the less fuzzy the objectives and the more specific the language, the higher the probability for success. An example of the structure of this document in the case of the Product A forecasting project is given in the Appendix at the end of this chapter.

The ideal team composition is shown in the corresponding charter section in the Appendix. In the case of some specific work processes, such as Six Sigma, the roles and responsibilities are well defined in generic categories like green belts, black belts, master black belts, and so on.

The most important practical deliverable in the project definition step is a committed financial support for the project since this is when the real project work begins. No funding—no forecasting. It is as simple as that.

Data preparation steps

Data preparation includes all necessary procedures to explore, clean, and preprocess the previously extracted data in order to begin model development with maximal possible information content in the data.3 In reality, data preparation is time consuming, nontrivial, and difficult to automate. Very often it is also the most expensive phase of applied forecasting in terms of time, effort, and cost. External data might need to be purchased, which can be a significant part of the project cost. The key data preparation substeps and deliverables are discussed briefly below. The detailed description of this step is given in Chapters 5 and 6.

Data collection

The initial data collection is commonly driven by the data structure recommended by the subject matter experts in the system structure and data identification step. Data collection includes identifying the internal and external data sources, downloading the data, and then harmonizing the data in a consistent time series database format.

In the case of the example for Product A price forecasting, data collection includes the following specific actions:

 identifying the data mart that stores the internal data

 identifying the specific services and tags of the external time series available in Global Insights (GI), Chemical Market Associates, Inc. (CMAI), Bloomberg, and so on.

 collecting the internal data is generally conducted by the business data SMEs

 collecting the external data is done using local GI or CMAI service experts

 harmonizing the collected internal and external data as a consistent time series database of the prescribed time interval

Data preprocessing

The common methods for improving the information content of the raw data (which very often are messy) include: imputation of missing data, accumulation, aggregation, outlier detection, transformations, expanding or contracting, and so on. All of these techniques are discussed in separate sections in Chapter 6.

Data preparation deliverables

The key deliverable in this step is a clean data set with combined and aligned targeted variables (Ys) and potential drivers (Xs) based on preprocessed internal and external data.

Of equal importance to the preprocessed data set is a document that describes the details of the data preparation along with the scripts to collect, clean and harmonize the data.

Variable reduction /selection steps

The objective of this block of the work process is to reduce the number of potential economic drivers for the dependent variable by various data mining methods. The data reduction process is done in two key substeps: (1) variable reduction and (2) variable selection in static transactional data. The main difference between the two substeps is the relation of the potential drivers or independent variables (Xs) to the targeted or dependent variables (Ys). In the case of variable reduction, the focus is on the similarity between the independent variables, not on their association with the dependent variable. The idea is that some of the Xs are highly related to one another thus removing redundant variables reduces data dimensionality. In the case of variable selection, the independent variables are chosen based on their statistical significance or similarity with the dependent variables. The details of the methods for variable reduction and selection are presented in Chapter 7 and a short description of the corresponding substeps and deliverables is given below.

Variable reduction via data mining methods

Since there is already a rich literature for the statistical and machine learning disciplines concerning approaches for variable reduction or selection, this book often refers to and contrasts methods used in “non-time series” or transactional data. New methods specifically for time series data are also discussed in more detail in Chapter 7. In the transaction data approach, the association among the independent variables is explored directly. Typical techniques, used in this case, are variable cluster analysis and principal component analysis (PCA). In both methods, the analysis can either be based on correlation or covariance matrices. Once the clusters are found, the variable with the highest correlation to the cluster centroid in each cluster is chosen as a representative of the whole cluster. Another approach, used frequently, is variable reduction via PCA where a transformed set of new variables (based on the correlation structure of the original variables) is used that describes some minimum amount of variation in the data. This reduces the dimensionality of the problem in the independent variables.

In the time series-based variable reduction, the time factor is taken into account. One of the most used methods is similarity analysis where the data is first phase shifted and time warped. Then a distance metric is calculated to obtain the similarity measures between each two time series xi and xj. The variables below some critical distance are assumed as similar and one of them can be selected as representative. In the case of correlated inputs the dimensionality of the original data set could be significantly reduced after removing the similar variables. PCA can also be used in time series data, an example of such is the work done by the Chicago Fed wherein a National Activity Index (CFNAI), based on 85 variables representing different sectors of the US economy, was developed.4

Variable selection via data mining methods

Again, there is quite a rich literature in variable or feature selection for transactional data mining problems. In variable selection the significant inputs are chosen based on their association with the dependent variable. As in the case with variable reduction, there are different methods applied to data with a time series nature as compared to that of transactional data. The first approach uses traditional transactional data mining variable selection methods. Some of the known methods, discussed in Chapter 7, are correlation analysis, stepwise regression, decision trees, partial least squares (PLS), and genetic programming (GP). In order to use these same approaches on time series data, the time series data has to be preprocessed properly. First, both the Ys and Xs are made stationary by taking the first difference. Second, some dynamic in the system is added by introducing lags for each X. As a result, the number of extended X variables to consider as inputs is increased significantly. However, this enables you to capture dynamic dependences between the independent and the dependent variables. This approach is often referred to as the poor man's approach to time series variable selection since much of the extra work is being done to prepare the data and then non-time series approaches are being applied.

The second approach is more specifically geared toward time series. There are four methods in this category. The first one is the correlation coefficient method. The second one is a special version of stepwise regression for time series models. The third method is similarity as discussed earlier in the variable reduction substep but in this case the distance metric is between the Y and the Xs. Thus, the smaller the similarity metric the better the relationship of the corresponding input to the output variable. The fourth approach is called co-integration, which is a specialized test that two time series variables move together in the long run. Much more detail is presented in Chapter 7 concerning these analyses.

One important addition to the variable selection is to be sure to include the SME's favorite drivers, or those discussed as such in market studies (such as CMAI in the chemical industry) or by the market analysts.

Event selection

Specific class variables in forecasting are events. These class variables help describing big discrete shifts and deviations in the time series. Examples of such variables are advertising campaigns before Christmas and Mother's Day, mergers and acquisitions, natural disasters, and so on. It is very important to clarify and define the events and their type in this phase of project development.

Variable reduction and selection deliverables

The key deliverable from the variable reduction and selection step is a reduced set of Xs that are less correlated to one another. It is assumed that it includes only the most relevant drivers or independent variables, selected by consensus based on their statistical significance and expert judgment. However, additional variable reduction is possible during the forecasting phase. Selected events are another important deliverable before beginning the forecasting activities.

As always document the variable reduction/selection actions. The document includes a detailed description of all steps for variable reduction and selection as well as the arguments for the final selection based on statistical significance and subject matter experts approval.

Forecasting model development steps

This block of the work process includes all necessary activities for delivering forecasting models with the best performance based on the available preprocessed data given the reduced number of potential independent variables. Among the numerous options to design forecasting models, the focus in this book is on the most used practical approaches for univariate and multivariate models. The related techniques and development methodologies are described in Chapters 811 with minimal theory and sufficient details for practitioners. The basic substeps and deliverables are described below.

Basic forecasting steps: identification, estimation, forecasting

Even the most complex forecasting models are based on three fundamental steps: (1) identification, (2) estimation, and (3) forecasting. The first step is identifying a specific model structure based on the nature of the time series and modeler's hypothesis. Examples of the most used forecasting model structures are exponential smoothing, autoregressive models, moving average models, their combination–autoregressive moving average (ARMA) models, and unobserved component models (UCM). The second step is estimating the parameters of the selected model structure. The third step is applying the developed model with estimated parameters for forecasting.

Univariate forecasting model development

This substep represents the classical forecasting modeling process of a single variable. The future forecast is based on discovering trend, cyclicality, or seasonality in the past data. The developed composite forecasting model includes individual components for each of these identified patterns. The key hypothesis is that the discovered patterns in the past will influence the future. In addition to the basic forecasting steps, univariate forecasting model development includes the following sequence:

 Dividing the data into in-sample set (for model development) and out-of-sample set (for model validation)

 Applying the basic forecasting steps for the selected method on an in-sample set

 Validating the model through appropriate residuals tests

 Comparing the performance by applying the model to an out-of-sample set where possible

 Selecting the best model

Multivariate (in Xs) forecasting model development

This substep captures all the necessary activities to design forecasting models based on causal variables (economic drivers, input variables, exogenous variables, independent variables, Xs). One possible option is to develop the multivariate models as a time series model by using multiple regression. A limitation of this approach, however, is that the regression coefficients of the forecasting model are based on static relationships between the independent variables (Xs) and the dependent variable (Y). Another option is to use dynamic multiple regression that represents the dynamic dependencies between the independent variables (Xs) and the dependent variable (Y) with transfer functions. In both cases, the same modeling sequence, described in the previous section, is followed. However, different model structures, such as autoregressive integrated moving average with exogenous input model (ARIMAX) or unobserved components model (UCM), are selected. Note that the forecasted values for each independent variable, selected in the multivariate model, are required for calculating the dependent variable forecast. In most cases the forecasted values are delivered via univariate models for the corresponding input variables, that is, developing univariate models is a part of the multivariate forecasting model development substep.

Consensus planning

In one specific area of forecasting—demand-driven forecasting—it is of critical importance that each functional department (sales, planning, and marketing) reach consensus on the final demand forecast. In this case, consensus planning is a good practice. It takes into account the future trends, overrides, knowledge of future events, and so on that are not contained in the history.

Forecasting model development deliverables

The selected forecasting models with the best performance are the key deliverable not only of this step but of the whole project. In order to increase the performance, the final deliverable is often a combined forecast from several models, derived from different methods. In many applications the forecasting models are linked in a hierarchy, reflecting the business structure. In this case, reconciliation of the forecasts in the different hierarchical levels is recommended.

Another deliverable is the selected models performance. The document summarizing the model performance of the final models must include key statistics as well as a detailed description of the model validation and selection process. If sufficient data are available, it is recommended to test the performance robustness while changing key model process parameters, that is, test the size of in-sample and out-of sample sets.

The most important deliverable, however, is to convince the user to apply the forecasting models on a regular basis and to accomplish the business objectives. One option is to compare the model-generated and judgmental forecasts. Another option is to give the user the chance to test the model with different “What-If” scenarios. For final acceptance, however, a consistent record of forecasts within the expected performance metric for some specified time period is needed. It is also critical to prove the pre-defined business impact, that is, to demonstrate the value created by the improved forecasting.

Forecasting model deployment steps

This block of the work process includes the procedures for transferring the forecasting solution from development to production environment. The assumption is that beyond this phase the models will be put into the hands of the final users. Some users actively apply the forecasting models to accomplish the defined business objectives either in an interactive mode, by playing “What-If” scenarios, or by exploring optimal solutions. Other users are interested only in the forecasting reports delivered periodically or on demand. In both cases, a special version of the solution in a system-like production environment has to be developed and tested. The important substeps and deliverables for this block of the work process are discussed briefly below.

Production mode model deployment

It is assumed in production mode the selected forecasting models can deliver automatic forecasts from updated data when invoked by the user or by another program. In order to accomplish this, the necessary data collection scripts, data preprocessing programs, and model codes are combined in one entity. (In the SAS environment the entity is called a stored process.) In addition to the software during the model development cycle, some code for testing the data consistency in the future data collections has to be designed and integrated in the entity. Usually, the test checks for large differences between the new data sample and the current historical values in the data. By default, the new forecast is based on applying the selected models with the existing model parameters over the updated data. In most cases the user interface in production mode is a user-friendly environment.

Forecasting decision-making process definition

In the end, the results from the forecasting models are used in business decisions, which create the real value. Unfortunately, with the exception of demand-driven forecasting (see examples in Chase, 2009), this substep is usually either ignored or implemented in an ad hoc manner. It is strongly recommended to specify the decision-making process as precisely as possible. Then the quality of the decisions should be tracked in the same way as the forecasting performance. Using the method of forecast value analysis (FVA) is strongly recommended.5 Even the perfect forecast can be buried by a bad business decision.

Forecasting model deployment deliverables

The ideal deliverable from this block of the work process is a user interface designed for the final user in an environment that he likes. In most of the cases that environment is the ubiquitous Microsoft Excel. Fortunately, it is relatively easy to build such an interface with the SAS Microsoft Add-in, as shown in Section 2.3.4.

Documenting the forecasting decision-making process is a deliverable of equal importance. The purpose of such a document is to define specific business rules that determine how to use the forecasting results. Initially the rule base can be created via brainstorming sessions with the subject matter experts. Another source of business rules definition could be a well-planned set of “What-If” scenarios generated by the forecasting models and analyzed by the experts. The end result is a set of business rules that link the forecasting results with specific actions and a value metric.

Training the user is a deliverable, often forgotten by developers. The training includes demonstrating the production version of the software. It is also expected that a help menu is integrated into the software.

Forecasting model maintenance steps

The final block of the proposed work process includes the activities for model performance tracking and taking proper corrective actions if the performance deteriorates below some specified critical limit. This is one of the least developed areas in practical forecasting in terms of available tools and experience. It is strongly recommended to discuss the model support issue in advance during the model definition phase. In the best-case scenario the project sponsor signs a service contract for a specified period of time. The users must understand that due to continuous changes in the economic environment forecasting models deteriorate with time and professional service is needed to maintain high-quality forecasts. A short description of the corresponding substeps and deliverables is given below.

Statistical baseline definition

The necessary pre-condition for performance assessment is to define a statistical baseline. The accepted baseline is called the naïve forecast, which assumes that the current observation can be used as the future forecast. It is also very important to explain to the final user the meaning of a forecast since non-educated users are looking only at the predicted number at the end of the forecast horizon as the only performance metric. A forecast is defined as the combination of: (1) predictions, (2) prediction standard errors, and (3) confidence limits at each time sample in the forecast horizon (Makridakis et al. 1998). The performance metric can be based on the difference between the defined forecast of the selected model and the accepted benchmark (naïve forecast).

Performance tracking

Performance monitoring is usually scheduled on a regular basis after every new data update. The tracking process includes two key evaluation metrics: (1) data consistency checks and (2) forecast performance metric evaluation. The data consistency check validates if the new data sample is not different from the most current data beyond some defined threshold. The forecast performance check is based on a comparison of the difference between the forecast of the selected model and the naïve forecast. Based on these two metrics, a set of decision rules is defined for appropriate corrective actions. The potential changes include either re-estimation of the model parameters and keeping the existing structure or complete model re-design and identifying a new forecast model structure.

Of critical importance is also tracking the business impact on KPIs of the forecast decisions. One possible solution for doing so is using business intelligence portals and dashboards (Chase 2009).

Forecasting model maintenance deliverables

The key deliverable in this final block of the work process is a performance status report. It includes the corresponding tables and trend charts to track the discussed metrics as well as the action items if corrective actions are taken.

2.3 Work Process with SAS Tools

The objective of this section is to specify how the proposed generic work process can be implemented with the wide range of software tools developed by SAS. A generic overview of the key SAS software tools related to data mining and forecasting is shown in Figure 2.3.

The SAS tools are divided in two categories depending on the requirements for programming knowledge: (1) tools that require programming skills and (2) tools that are based on functional blocks schemes and do not require programming skills. The first category consists of the software kernel of all SAS products—Base SAS with its set of operators and functions as well as specific toolboxes of specialized functions in selected areas. Examples of such toolboxes, related to data mining and forecasting, are SAS/ETS (includes the key functions for time series analysis), SAS/STAT (includes procedures for a wide range of statistical methodologies), SAS/GRAPH (allows creating various high resolution color graphics plots and chart), SAS/IML (enables programming of new methods based on the powerful Interactive Matrix Language IML), and SAS High-Performance Forecasting (includes a set of procedures for High-Performance Forecasting).

The second category of SAS tools, based on functional block schemes, shown in Figure 2.3, includes three main products: SAS Enterprise Guide, SAS Enterprise Miner, and SAS Forecast Server. SAS Enterprise Guide allows high-efficiency data preprocessing and development, basic statistical analysis, and forecasting by linking functional blocks. SAS Enterprise Miner is the main tool for developing data mining models based on build-in functional blocks and SAS Forecast Server is a highly productive forecasting environment with a very high level of automation. The business clients can interact with all model development tools via SAS Microsoft Add-in.

Figure 2.3: SAS software tools related to data mining in forecasting


SAS also has another product with statistical, data mining, and forecasting capabilities. It is called JMP. However, because its functionality is similar to SAS Enterprise Guide and SAS Enterprise Miner, it is not discussed in this book. For those readers interested in the forecasting capabilities of JMP, a good starting point is JMP Start Statistics: A Guide to Statistics and Data Analysis Using JMP (Sall J., Creighton L., and Lehnan, A. 2009).

2.3.1 Data Preparation Steps with SAS Tools

The wide range of SAS tools gives the developer many options to effectively implement all of the data preparation steps. Good examples at the Base SAS level are procedures, such as DATA step for generic data collection or PROC SQL for writing specific data extracts.6 The specific functions or built-in functional blocks for data preparation in the SAS tools that are related to data mining and forecasting are discussed briefly below.

Data preparation using SAS/ETS

The key SAS/ETS procedures for data preparation are as follows:

DATASOURCE provides seamless access to time series data from commercial and governmental data vendors, such as Haver Analytics, Standard & Poor's Compustat Service, the U.S. Bureau of Labor Statistics, and so on. It enables you to select the time series with specific frequency over a selected time range across sections of the data.

EXPAND provides different types of time interval conversions, such as converting irregular observations in periodic format or constructing quarterly estimates from annual data. Another important capability of this procedure is interpolating missing values for time series via the following methods: cubic splines, linear splines, step functions, and simple aggregation.

TIMESERIES has the ability to process large amounts of time-stamped data. It accumulates transactional data to time series and performs correlation, trend, and seasonal analysis on the accumulated time series. It also delivers descriptive statistics for the corresponding time series data.

X11 and X12 both provide seasonal adjustment of time series by decomposing monthly or quarterly data into trend, seasonal, and irregular components. The procedures are based on slightly different methods that were developed by the U.S. Census Bureau as the result of years of work by census researchers. X12 includes additional diagnostic tests to be run after the decomposition and the ability to remove the effect of input variables before the decomposition.7

Data preparation using SAS Enterprise Guide

SAS Enterprise Guide has built-in functional blocks that enable you to automate many data manipulation procedures (such as filtering, sorting, transposing, ranking, and comparing) without writing programming code. The two functional blocks for time series data preparation are Create Time Series Data and Prepare Time Series Data. Each block is a functional user interface to SAS/ETS procedures. Create Time Series Data is the user interface to TIMESERIES and Prepare Time Series Data is the corresponding user interface to EXPAND.

The advantage of using the functional block flows for implementing different steps of the proposed work process is clearly demonstrated with a simple example in Figure 2.4. The SAS Enterprise Guide flow shows the process of developing ARIMA forecasting models from the transactional data of 42 products. The original 42 transactional data are transformed as a time series of monthly data by the Create Time Series block, and the forecasting models are generated by the ARIMA Modeling functional block. The results with the corresponding graphical plots are summarized and output in a Word document.

Figure 2.4: An example of SAS Enterprise Guide flow for time series data preparation and | modeling


Another advantage of SAS Enterprise Guide is that it can access all SAS procedures either as separate blocks or as additional code within the existing blocks.

Data preparation using SAS Enterprise Miner

SAS Enterprise Miner is another SAS tool based on functional blocks, but its focus is on data mining. An additional advantage of this product is that it also imposes a work process. The work process abbreviation SEMMA (Sample-Explore-Modify-Model-Assess) includes the following key steps:

Sample

the data by creating informational rich data sets. This step includes data preparation blocks for importing, merging, appending, partitioning, and filtering, as well as statistical sampling and converting transactional data to time series data.

Explore

the data by searching for clusters, relationships, trends, and outliers. This step includes functional blocks for association discovery, cluster analysis, variable selection, statistical reporting and graphical exploration.

Modify

the data by creating, imputing, selecting, and transforming the variables. This step includes functional blocks for removing variables, imputation, principal component analysis, and defining transformations.

Model

the data by using various statistical or machine learning techniques. This step includes the use of functional blocks for linear and logistic regression, decision trees, neural networks, partial least squares, among others, and importing models defined by other developers even outside SAS Enterprise Miner.

Assess

the generated solutions by evaluating their performance and reliability. This step includes functional blocks for comparing models, cutoff analysis, decision support, and score code management.

The data preparation functionality is implemented in the Sample and Modify sets of functional blocks.

Recently, a special set of SAS Enterprise Miner functional blocks related for Time Series Data Mining (TSDM) has been released by SAS. Its functionality covers most of the needed procedures for exploring forecasting data. The data preparation step is delivered by a Time Series Data Preparation node (TSDP), which provides data aggregation, summarization, differencing, merging, and the replacement of missing values.

2.3.2 Variable Reduction and Selection Steps with SAS Tools

Variable reduction and selection steps using specialized SAS subroutines

The key procedures for variable reduction and selection based on SAS/ETS and SAS/STAT are discussed briefly below.

AUTOREG (SAS/ETS) estimates and predicts linear regression models with autoregressive errors as well as stepwise regression. It also combines autoregressive models with autoregressive conditionally heteroscedastic (ARCH) and generalized autoregressive conditionally heteroscedastic (GARCH) models and generates a variety of model diagnostic tests, tables, and plots.

MODEL (SAS/ETS) analyzes and simulates nonlinear systems regression. It supports dynamic nonlinear models of multiple equations and includes a full range of nonlinear parameter estimation methods, such as nonlinear ordinary least squares, generalized method of moments, nonlinear full information maximum likelihood, and so on.

PLS (SAS/STAT) fits models by extracting successive linear combinations of the predictors, called factors (also called components or latent variables), which optimally address one or both of these two goals: explaining response or output variation and explaining predictor variation. In particular, the method of partial least squares balances the two objectives, seeking factors that explain both response and predictor variation. The contribution of the original variables to the factors is important to variable selection.

PRINCOMP (SAS/STAT) provides PCA on the input data. The results contain eigenvalues, eigenvectors, and standardized or unstandardized principal component scores.

REG (SAS/STAT) is used for linear regression with options for forward and backward stepwise regression. It provides all necessary diagnostic statistics.

SIMILARITY (SAS/ETS) computes similarity measures associated with time-stamped data, time series, and other sequentially ordered numeric data. A similarity measure is a metric that measures the distance between the input and target sequences while taking into account the ordering of the data.

VARCLUS (SAS/STAT) divides a set of variables into clusters. Associated with each cluster is a linear combination of the variables in the cluster. This linear combination can be generated by two options: as a first principal component or as a centroid component. The VARCLUS procedure creates an output data set with component scores for each cluster. A second output data set can be used to draw a decision tree diagram of hierarchical clusters. The VARCLUS procedure is very useful as a variable-reduction method since a large set of variables can be replaced by the set of cluster components with little loss of information.

Variable reduction and selection steps using SAS Enterprise Miner

The data mining capabilities in SAS Enterprise Miner for variable reduction and selection are spread in Explore, Modify, and Model tabs. It is not a surprise that the functional blocks are based on those SAS procedures, discussed in the previous section. The functional blocks or nodes of interest are the following:

In Explore tab:

Variable Clustering node implements the VARCLUS procedure in SAS Enterprise Miner—that is, it assigns input variables to clusters and allows variable reduction with a small set of cluster-representative variables.

Variable Selection node evaluates the importance of potential input variables in predicting the output variable based on R-squared and Chi-squared selection criterion. The variables that are not related to the output variable are assigned with rejected status and are not used in the model building.

In Modify tab:

Principal Components node implements the PRINCOMP procedure and in the case of linear relationship, reduces the dimensionality of the original input data to the most important principal components that capture a significant part of data variability.

In Model tab:

Decision Tree node splits the data in the form of a decision tree. Decision tree modeling is based on performing a series of if-then decision rules that sequentially divide the target variable into a small number of homogeneous groups that formulate a tree-like structure. One of the advantages of this block, in the case of variable selection, is that it automatically ranks the input variables, based on the strength of their relationship to the tree.

Partial Least Squares node implements the PLS procedure.

Gradient Boosting node uses a specific partitioning algorithm, developed by Jerome Friedman, called a gradient boosting machine.8

Regression node generates either linear regression models or logistic regression models. It supports stepwise, forward, and backward variable selection methods.

Two SAS Enterprise Miner nodes—TS Similarity (TSSIM) and TS Dimension Reduction (TSDR), which are part of the new Time Series Data Mining tab—can be used for variable reduction as well. The TS Similarity node implements the SIMILARITY based on four distance metrics: squared deviation, absolute deviation, mean square deviation, and mean absolute deviation and delivers a similarity map. The TS Dimension Reduction node applies four reduction techniques on the original data: singular value decomposition (SVD), discrete Fourier transformation (DFT), discrete wavelet transformation (DWT), and line segment approximations.

2.3.3 Forecasting Steps with SAS Tools

Forecasting using SAS/ETS

The key SAS/ETS forecasting procedures are described briefly below.

ARIMA generates ARIMA and ARIMAX models as well as seasonal models, transfer function models, and intervention models. The modeling process includes identification, parameter estimation, and forecast with generation of a variety of diagnostic statistics and model performance metrics, such as Akaike's information criterion (AIC) and Schwartz's Bayesian criterion (SBC or BIC).

ESM can generate forecasts for time series and transactional data based on exponential smoothing methods. It also includes several data transformation methods, such as log, square root, logistic, and Box-Cox.

FORECAST is the old version of ESM.

STATESPACE generates multivariate models based on different system representation by state space variables. It includes automatic model structure selection, parameter estimation, and forecasting.

UCM provides a development tool for unobserved component models. It generates the corresponding trend, seasonal, cyclical, and regression effects components, estimates the model parameters, performs model diagnostics, and calculates the forecasts and confidence limits of all the model components and the composite series.

VARMAX is very useful for forecasting multivariate time series, especially when the economic or financial variables are correlated to each other's past values. The VARMAX procedure enables modeling the dynamic relationship both between the dependent variables and between the dependent and independent variables. It uses a variety of modeling techniques, criteria for automatic determination of the autoregressive and moving average orders, model parameter estimation methods, and several diagnostic tests.

Forecasting using SAS Enterprise Guide

The forecasting capabilities of the SAS Enterprise Guide built-in blocks are very limited. However, all the SAS/ETS functionality can be used via SAS Enterprise Guide code nodes. The key built-in forecasting blocks in the Time Series Tasks are described briefly below.

Basic Forecasting

generates forecasting models based on exponential smoothing and stepwise autoregressive fit of time trend.

ARIMA Modeling and Forecasting

generates ARIMA models, but the identification and parameter estimation methods have to be selected by the modeler.

Regression Analysis with Autoregressive Errors

provides linear regression models for time series data in the case of correlated errors and heteroscedasticity.

Forecasting using SAS Forecast Studio

SAS Forecast Studio is one of the most powerful engines for large-scale forecasting available in the market. It generates automatic forecasts in batch mode or executes custom-built models through an interactive graphical interface. SAS Forecast Studio enables the user to interactively set up the forecasting process, hierarchy, parameters, and business rules as well as to enter specific events. Another very useful feature is hierarchical reconciliation with the ability to reconcile the hierarchy bottom-up, middle-out, or top-down.

SAS Forecast Studio does not require programming skills, and the whole forecasting step can be done in an easy-to-use GUI interface where the model selection list includes exponential smoothing models with optimized parameters, ARIMA models, unobserved components models, dynamic regression, and intermittent demand models. It is possible also to define a model repository and events. The automatic model generation includes outlier detection, event identification, and automatic variable selection. The forecasting results are represented in numerous graphical reports. In the case of multivariate models, you can explore different “What if” scenarios to determine the influence of the key drivers in the dependent variable forecast. Forecast studio is a highly productive environment. It can generate thousands of time series forecasts in minutes.

2.3.4 Model Deployment Steps with SAS Tools

Model deployment on model development tools

Some forecasting applications use the development environment for model deployment. An obvious disadvantage of this option is that the user must be familiar with at least some of the capabilities of the development software. Using the development environment for model deployment is appropriate only in specialized cases with educated end users. In the case of large-scale industrial forecasting, however, this option is not recommended.

Model deployment via stored processes

One of the advantages of using various SAS tools is that you can communicate the results to the user via stored processes. A stored process is a SAS program that is stored centrally on a server. The final validated models are usually packaged as stored processes in the development environment (SAS Enterprise Guide, SAS Enterprise Miner, or SAS Forecast Server) and saved on a specified server. A client application can then execute the program, and then receive and process the results locally.

The most popular application for using the SAS stored processes on the client side is the SAS Add-In for Microsoft Office. After installing the Add-In, the user can select the corresponding stored process and invoke the forecasting application. In the case of Excel, the results are represented in spreadsheets with the rich graphic capabilities of this popular tool. Another option for model deployment is via SAS Web Report Studio.

2.3.5 Model Maintenance Steps with SAS Tools

One option for model maintenance is using SAS Model Manager, which manages and monitors analytical model performance in a central repository. It enables users to monitor the performance of a large number of models by defining different performance metrics and then generating model performance and comparison reports. Unfortunately, managing models generated by SAS Forecast Studio is currently not possible with SAS Model Manager.

Another option for forecasting model performance tracking is to develop corresponding stored processes that generate periodic reports. In the case of performance deterioration, the user can contact the developer to perform the proper corrective actions of parameter re-fitting or complete model re-development. These options are available in SAS Forecast Studio.

2.3.6 Guidance for SAS Tool Selection Related to Data Mining in Forecasting

At the end of this section are some generic guidelines on how to select the appropriate SAS tools so that you can apply the work process discussed.

SAS/ETS includes all generic functions for forecasting, such as: FORECAST, AUTOREG, ARIMA, VARMAX, X11, X12, SPECTRA, and so on.9 It is a proper solution if the model developers have good programming skills in Base SAS and are knowledgeable in forecasting methods.

SAS/STAT provides the generic statistical functions such as REG, PRINCOMP, PLS, VARCLUS, and so on. From an implementation point of view, it has the same requirements as SAS/ETS. Both tools are appropriate for small-scale applications and prototype development and require skilled SAS programmers.

SAS Enterprise Guide enables fast system development based on a combination of built-in functional blocks using Base SAS procedures. It is a very good environment to integrate data preprocessing and some data mining and forecasting functions. SAS Enterprise Guide requires minimal Base SAS programming experience. Another advantage of SAS Enterprise Guide is its impressive reporting and graphical capabilities.

SAS Enterprise Miner is the ideal non-programming development tool for data preprocessing and data mining activities. An additional advantage in the case of data mining in forecasting is the currently developed node for Time Series Data Mining (TSDM), which enables fully functional time series preprocessing, variable reduction, and selection.

SAS Forecast Server provides automatic model and report generation for a wide range of forecasting algorithms. For large-scale industrial forecasting, this is the tool.

2.4 Work Process Integration in Six Sigma

The best-case scenario for implementing the proposed work process for data mining in forecasting is to integrate it into some existing work process. Doing so minimizes cultural change since the organization of interest has already introduced work processes according to its strategy. One popular work process in many organizations using demand-driven forecasting is Sales & Operation Planning (S&OP) in which the operations planning and financial departments match supply to a demand forecast and generate supply plans.10

In this book we discuss a more generic integration of the proposed work process into the most widespread work process in industry—Six Sigma. The advantages of this integration are as follows: fast acceptance in more than 50% of Fortune 500 companies (see below), reduced organizational efforts, established project management, well-defined stakeholders' roles, good opportunities for training and leveraging, and so on.

2.4.1 Six Sigma in Industry

What it is

A method or set of techniques, Six Sigma, has become a movement and a management religion for business process improvement.11 It is a quality measurement and improvement program originally developed by Motorola in the 1980s that focuses on the control of a process to the point of ± Six Sigma (standard deviations) from a centerline. The Six Sigma systematic quality program provides businesses with the tools to improve the capability of their business processes. At the basis of Six Sigma methodology is the simple observation that customers feel the variance, not the mean. In other words, reducing the variance of product defects is the key to making customers happy. What is important for Six Sigma is that it provides not only technical solutions but a consistent work process for pursuing continuous improvement in profit and customer satisfaction. This is one of the reasons for the enormous popularity of this methodology in industry.

Industrial acceptance

According to the iSixSigma Magazine,12 about 53% of Fortune 500 companies are currently using Six Sigma, and that figure rises to 82% for the Fortune 100. Over the past 20 years, use of Six Sigma has saved Fortune 500 companies an estimated $427 billion. Companies that properly implement Six Sigma have seen profit margins grow 20% year after year for each sigma shift (up to about 4.8 sigma to 5.0 sigma). Since most companies start at about 3 sigma, virtually each employee trained in Six Sigma will return on average $230,000 per project to the bottom line until the company reaches 4.7 sigma. After that, the cost savings are not as dramatic.

Key Roles

One of the key advantages of Six Sigma is its use of well-defined roles in project development. The typical roles are defined as: Champion, Black Belt, Green Belt, and Master Black Belt. The Champion is responsible for the success of the projects, provides the necessary resources and breaks down organizational barriers. The project leader is called a Black Belt. Project team members are called Green Belts and they do not spend all their time on projects. They receive training similar to that of Black Belts but for less time. There is also a Master Black Belt level. These are experienced Black Belts who have worked on many projects. They are the ones who typically know about more advanced tools, the business, and have had leadership training and often have teaching experience. A primary responsibility of Master Black Belts is mentoring new Black Belts.

2.4.2 The DMAIC Process

The classic Six Sigma methodology includes the following key phases known as Define-Measure-Analyze-Improve-Control (DMAIC) process:

Define: Understand the problem.
Measure: Collect data on the problem.
Analyze: Find root cause as to why the problem occurs.
Improve: Make changes to eliminate root causes.
Control: Ensure that the problem is solved.

A brief description of each phase is given below.

Define

The objective of the define phase is to identify clearly and communicate to all stakeholders the problem to be solved. The team members and the timelines are laid down. Project objectives are based on identifying the needs by collection of the voice of the customer. The opportunities are defined by understanding the flaws of the existing as-is process. The key document in this phase is the project charter, which includes the financial and technical objectives, assessment of the necessary resources, allocation of the stakeholders' roles, and a project plan.

Measure

The goal of the measure phase is to understand the problem in more detail by collecting all available data around it. The following questions need to be answered: what the problem really is, where it occurs, when it occurs, what causes it, and how does it occur. The key deliverables in this phase are identifying the necessary factors (inputs) that can influence the defect (output), and collecting and preparing the data for analysis. Another very important deliverable is the statistical measures from the data, such as the sigma level of the defect and process capability (the ability of a process to satisfy customer expectations), measured by the sigma range of a process's variation.

Analyze

The objective of the analyze phase is to analyze the collected data and to find out the root causes (critical Xs) of the problem. The analyses are mostly statistically based and are designed to identify the critical inputs affecting the defect or the output Y = f (x). The potential cause-effect relationships or models are discussed and prioritized by the experts. As a result, several potential solutions of the problem are identified for deployment.

Improve

The goal of the improve phase is to prioritize the various solutions that were suggested during brainstorming and to explore the best solution with highest impact and the lowest cost effort. At the end of this phase, a pilot solution is implemented. The purpose of the improve phase is to remove the impact of the root causes by implementing changes in the process. Before beginning the implementation steps, however, the selected solution is tested with data to validate the predicted improvements. The initial results from pilot implementation are communicated to all stakeholders.

Control

The objective of this final phase is to complete the implementation of the selected solutions and to validate that the problem has gone away. A measurement system is usually set up to determine if the problem has been solved and the expected performance has been met. One of the key deliverables in the control phase is a process control and monitor plan. Part of the plan is the transition of ownership from the development team to the final user.

2.4.3 Integration with the DMAIC Process

Two options for integration with the DMAIC Six Sigma process are discussed briefly below: (1) the integration of a data mining work process and (2) the integration of the proposed data mining in a forecasting work process.

Data mining within DMAIC

The key blocks of a data mining process within the Six Sigma framework (defined in Kalos and Rey 2005) are shown in Figure 2.5.

The purpose of the first key block, Strategic Intent, is to ensure the relevance of the proposed data mining project with the strategic business goals, enterprise-wide initiatives, and management improvement plans. Another objective of this block is to identify the business success criteria, including various measurements (customer, process, and financial measurements).

The second key block, System & Data Identification, has objectives similar to those of the system structure and data identification substep in the Project Definition block. The third key block, Data Preprocessing, includes activities such as preliminary data analysis, variable selection, data transformation, and documenting the results. The fourth block, Opportunity Discovery, combines data analysis strategy development, exploratory data analysis, and model development and performance assessment. The last, fifth block of the data mining process within Six Sigma, Opportunity Deployment, is characterized by three main activities: (1) immediately using the developed models for business decisions, (2) integrating the developed models in other projects, and (3) triggering other projects based on the discovered opportunities and generating of preliminary Six Sigma project charters.

Figure 2.5: Key blocks of data mining in Six Sigma


Figure 2.5 is from Alex Kalos and Tim Rey's paper, “Data mining in the chemical industry” (2005). The details of using this data mining process within the Six Sigma framework are also given in this paper.

Data mining in forecasting within DMAIC

The other option of integrating the proposed work process for data mining in forecasting within Six Sigma is illustrated in Figure 2.6 where we can see the corresponding links between the key blocks of both methodologies. The project definition steps, including system identification, are part of the define phase of DMAIC. The data preparation steps belong to the measure phase, and both variable selection and reduction and Forecasting model development steps are included in the analyze phase. The forecasting model deployment steps are part of the improve phase of DMAIC and the last part of the forecasting work process, the forecasting model maintenance steps, are linked to the control phase of DMAIC.

Figure 2.6: Correspondence of the Data Mining in Forecasting Work Process with DMAIC


Because of the clear link between the proposed work process (based on the requirements for developing high-performance forecasting) and a work process such as Six Sigma (that is almost universally adopted in industry), you can integrate the two processes with minimal effort and cultural change. As a result, you have greater opportunities to introduce the proposed methodology and can more efficiently manage projects and develop forecast systems.

Appendix: Project Charter

Opportunity Statement:

 Current forecast is judgmental with an average Mean Average Percent Error (MAPE) of 16.5 for four quarterly forecasts.

 The opportunity is to improve the forecast by using statistical methods.

 The key hypothesis is that more accurate forecasts will lead to proactive business decisions that will increase consistently profit by at least 10%.

Project Goal and Objective:

The technical objective of the project is to develop, deploy, and support, for at least three years, a quarterly forecasting model that projects the price of Product A for a two-year time horizon and that outperforms the accepted statistical benchmark (naïve forecasting in this case) by 20% based on average of four consecutive quarterly forecasts.

Project Scope and Boundaries:

 The project will focus on Product A price in Germany.

Deliverables:

 a forecasting model with user interface in Excel

 a decision scheme with proactive action items to increase profits

Timeline:

Estimated duration of the key steps of the project:

Project definition: 40 hours
Data preparation: 80 hours
Model development: 60 hours
Model deployment: 20 hours
Model maintenance: 10 hours per year

Team Composition:

The ideal team includes:

Management sponsor

Project owner

Project leader

Technical subject matter experts

Model developers

End users

1 A good starting point for developing mind-maps is Tony Buzan's The Mind-map Book (2003).

2 The mind-maps in this book are based on the Mindjet product MindManager 8, available from http://www.mindjet.com/.

3 A classic book about data preparation is Dorian Pyle's Data Preparation for Data Mining (1999).

4 Evans, C., Liu, C. and Pham-Kanter, G. “The 2001 recession and the Chicago Fed National Activity Index: Identifying business cycle turning points,” Economic Perspectives 26, no. 3 (2002): 26–43.

5 The FVA method is described in Michael Gilliland's book, The Business Forecasting Deal (2010).

6 A book with many examples of using different SAS solutions for data preparation is Gerhard Svolba's Data Preparation for Analytics Using SAS (2006).

7 A good explanation of X11 and X12 is given by Spyros G. Makridacis et al. in Forecasting: Methods and Applications (1997).

8 Friedman, J. H. “Greedy function approximation: A gradient boosting machine,” Annals of Statistics 29 (2001): 1189–1232.

9 A useful classification of the SAS/ETS functions is given in Table 1.1 in the book SAS for Forecasting Time Series (2003) by John Brocklebank and David Dickey.

10 A detailed description of S&OP is given in Charles Chase's Demand-Driven Forecasting: A Structured Approach to Forecasting (2009).

11 The reader can find more information about Six Sigma in Implementing Six Sigma: Smarter Solutions Using Statistical Methods (2003) by Forrest Breyfogle III.

12 January/February 2007 Issue at http://www.isixsigma-magazine.com/

Applied Data Mining for Forecasting Using SAS

Подняться наверх