Читать книгу Applied Data Mining for Forecasting Using SAS - Tim Rey - Страница 9
ОглавлениеChapter 3: Data Mining for Forecasting Infrastructure
3.2.1 Personal Computers Network Infrastructure
3.2.2 Client/Server Infrastructure
3.2.3 Cloud Computing Infrastructure
3.3.1 Data Collection Software
3.3.2 Data Preparation Software
3.3.5 Software Selection Criteria
3.4.1 Internal Data Infrastructure
3.4.2 External Data Infrastructure
3.5 Organizational Infrastructure
3.5.1 Developers Infrastructure
3.5.3 Work Process Implementation
3.1 Introduction
Applying data mining for forecasting in a business requires serious investments in hardware, software, and training, but a cultural change must also take place. It is very important to estimate the size of the investment based on technical requirements and the products that are available in the market. The four main components of any forecasting infrastructure are hardware, software, data, and organizational. The first three components build the technical basis to support applied data mining for forecasting, and the fourth component is critical to effectively change the culture of the organization. This chapter is focused on an enterprise-wide implementation strategy of data mining for forecasting. The importance of integrating the selected options into the existing corporate infrastructure is discussed at the end of the chapter.
3.2 Hardware Infrastructure
The objective of this section is to give the reader a condensed overview of the potential hardware architectures for implementing data mining for forecasting systems in an industrial setting. The following three options: (1) PC network, (2) client/server, and (3) cloud computing infrastructures are discussed briefly below. However, due to rapid technology changes today's recommendations can easily become obsolete tomorrow.
3.2.1 Personal Computers Network Infrastructure
The least expensive hardware solution for implementing data mining for forecasting systems in an industrial setting is to avoid any additional hardware expenses and use the existing information system infrastructure. Usually, this is based on a PC network. The key advantages of this option are as follows:
low cost
easy integration in the existing information system infrastructure
minimal installation and maintenance efforts
robust performance due to the decentralized architecture
The main limitations of the PC network infrastructure solution for implementing data mining for forecasting systems are as follows:
limitations for large data set processing
slower processing speed relative to servers
limited operating systems options
3.2.2 Client/Server Infrastructure
The client/server model assumes a division of the computing resources between clients or workstations with local processing capabilities and servers with large memory and disk space and more powerful processors. The clients request services such as data, and the servers retrieve resources and deliver the requested information. The number of servers required depends on the number of clients, network speed and capacity, global and local operation, reliability, and so on.
An example of a minimal client/server infrastructure based on SAS is shown in Figure 3.1. The example includes four types of servers and two types of clients—modeler PC and final user PC. One server is allocated to handle metadata. A data mart server, based on Oracle, interacts with the large database cluster containing the corporate data. The third server includes the SAS server and is devoted to intensive computing tasks. Several clients can share the server resources either for developing new models or running developed models as stored processes.
The key advantages of the client/server infrastructure for implementing data mining for forecasting are given below:
very powerful processing capabilities
large memory and high-throughput disk
the use of different operating systems
capacity to process large data sets.
Figure 3.1: An example of client/server infrastructure based on SAS
The disadvantages of this option are as follows:
high cost
more complex maintenance and support
lower reliability if servers are down
The advantages, however, outweigh the disadvantages and the client/server infrastructure is the standard solution for large-scale industrial applications of data mining and forecasting.
3.2.3 Cloud Computing Infrastructure
Another potential solution, called cloud computing, uses powerful external and internal computing resources, and includes grid computing for parallel processing, multi-tiered computer architecture, and the capacity to handle super-large data sets. Such services are currently offered by a number of vendors including well-established industry leaders. Some of the advantages of using this option are as follows:
low implementation and maintenance cost
super-computer power, which is continuously upgraded by the cloud owner
data consolidation in very large data sets
increased reliability
The disadvantages of using a cloud computing infrastructure are summarized as follows:
proprietary data security
initial transfer of very large corporate data to the cloud
limited software
trust issues
information technology (IT) management resistance
This option is still in an exploratory phase and has generated a lot of hype. However, if the technical and economic advantages are proved with more industrial applications, it could become a popular hardware infrastructure in the near future.
3.3 Software Infrastructure
The lion's share of the costs for implementing data mining for forecasting systems, especially for the PC network infrastructure, is not the cost of hardware but the cost of software infrastructure. One of the key decisions to make in advance is the scale of the efforts. In the case of large-scale forecasting on a corporate level that is to be implemented across the globe, an integrated software environment made up of all necessary components with global support is strongly recommended. An example of such infrastructure (based on SAS software) is discussed in this book.
3.3.1 Data Collection Software
This part of the infrastructure strongly depends on the existing corporate information system architecture. Unfortunately, it could be very diverse with different database platforms. In most cases, however, the data are organized in relational databases and stored in separate tables for each entity. The relationship between the tables is defined by two columns—primary key and foreign key columns (Svolba 2006). Data that are accessed from a relational database are usually extracted table by table and are merged according to the primary or foreign keys.
The software basis for handling data in relational database systems is the Structure Query Language (SQL). It includes the necessary operators for searching data pieces as well as different aggregations and joins of tables. The leading relational database systems include Oracle, SAP MaxDB and Sybase, Microsoft SQL Server, and IBM DB2. The good news is that the existing key software programs for data mining, such as SAS Enterprise Miner, IBM SPSS1 and StatSoft STATISTICA Data Miner2 include all necessary software interfaces to collect data from diverse sources.3 For example, SAS offers a specialized tool, SAS/ACCESS, that has almost universal capabilities for access, retrieval, and integration with any available data source.4
3.3.2 Data Preparation Software
It is recommended that the selected software has the following functionality for data preparation:
Data manipulation capabilities that include functions for summary tables generation, data split, concatenation, transposition, stacking, sorting, flexible filtering, joining tables, and so on.
Missing data handling that includes different options to impute missing data.
Data description capabilities that are usually based on basic descriptive statistics, frequency tables, histograms, and so on.
Data visualization capabilities that include a broad spectrum of graphics, such as 3-D scatter plots, contour plots, parallel plots, and so on.
Data pre-processing capabilities that include filtering, outlier detection and removal, data sampling, data partitioning, data transformation, and so on.
Examples of software tools with these capabilities are SAS Enterprise Guide, JMP, IBM SPSS, and StatSoft STATISTICA Data Miner.
3.3.3 Data Mining Software
From the broad range of available data mining methods and functions, the following capabilities for variable reduction and selection are needed for the forecasting applications:
Basic statistical capabilities that include building and analyzing linear regression models with options for variable selection by forward and backward stepwise regression.
Multivariate analysis capabilities that include cross-correlation analysis, PCA, and PLS.
Clustering capabilities that include dividing variables in clusters by linear or nonlinear methods, similarity analysis, and building decision trees.
Variable selection capabilities that include different algorithms for variable selection, such as stepwise regression, decision trees, gradient boosting, singular value decomposition (SVD), and so on.
The three most popular software options for industrial applications that offer most of these capabilities are SAS Enterprise Miner, IBM SPSS, and StatSoft STATISTICA Data Miner.
3.3.4 Forecasting Software
The recommended capabilities for effective development of forecasting models in industrial applications are as follows.
Time series analysis capabilities that include generating time series, different time plots, correlations, seasonality adjustments, decompositions, and so on.
Forecasting model generation capabilities that include the most popular methods, such as exponential smoothing, ARIMA, unobserved components, and so on with a variety of diagnostic statistics and model performance metrics.
Forecasting modeling with events capabilities that enable the introduction of big discrete shifts in the model development.
Hierarchical forecasting capabilities that include developing a model hierarchy at the desired level based on the existing business structure and reconciling this with the final forecast.
Scenario generation capabilities for multivariate-based forecasting models—these different “what if” scenarios can show the impact of the key inputs on the final forecast.
The most powerful software tools that offer these capabilities are SAS Forecast Studio, Automatic Forecasting Systems Autobox and Business Forecast Systems Forecast Pro.
3.3.5 Software Selection Criteria
In addition to the specific technical capabilities of the key software components for a data mining for forecasting system, the following generic selection criteria are recommended:
Cost depends mostly on the ease-of-use of the corresponding packages. Most of the time the tools based on building blocks (such as SAS Enterprise Miner) or the high-performance forecasting tools (such as SAS Forecast Server) cost more. However, the increased productivity they deliver is significant. An additional advantage is the shorter learning and product adaptation time, which lowers the total cost.
Functionality—it is strongly recommended that you carefully check whether the necessary technical functionality is available, as described in the previous sections, and to avoid any compromises. The capability to add new methods is also recommended.
Ease-of-use is enhanced by programming based on building blocks, a high level of automation for data pre-processing and model generation, an interactive graphic interface, and minimal programming necessary to deploy models (all features of SAS Enterprise Guide, for example).
Report generation is a significant step during model development as well as during model deployment and when transferring ownership to clients. During the model building phase many detailed reports with time series analysis, model diagnostics, or variable selection results are needed for successful decision-making. For model deployment, good reporting capabilities for model performance and value tracking are critical in order to keep the client happy.
The learning effort required depends on the software's ease-of-use, users' experience in statistics and forecasting within the organization, and the training courses and materials offered by the vendor. Products with a steep learning curve can significantly delay implementation efforts and reduce the impact of the technology for data mining in forecasting.
Global support 24/7—a fast, professional response to model development and implementation issues that is available globally is critical for the success of data mining for forecasting in industry. This is one of the key factors to consider when selecting the proper software vendor. Very few have the capacity to provide this type of service.
3.4 Data Infrastructure
Developing and maintaining a data infrastructure that can reliably supply the data to the developed and apply forecasting models is a critical step for the final success. The data infrastructure for data mining in forecasting consists of two key parts: internal data from the business and external data from various sources, such as Global Insight, Bloomberg, CMAI, and so on. The essence of both cases is described briefly in the following sections.
3.4.1 Internal Data Infrastructure
Very often creating an internal data infrastructure for data mining in forecasting is the key bottleneck of the whole effort. There are several issues that contribute to this situation. The first issue is the diverse nature of data sources in different parts of the business. This issue is especially difficult to resolve during the transition period after mergers and acquisitions when various types of databases need to be integrated. The second issue is the different time interval and duration with which historical data are kept in the system. Very often the time interval (week, month, or quarter) is different and inconsistent for the historical periods of interest. A similar situation is observed with the duration of historical data. In many cases time history is too short to represent the patterns necessary to build and validate a good forecasting model. The third issue is the structural changes in the business since corresponding models need to be rebuilt with revised history after each significant change.
The internal data infrastructure depends on the corporate data infrastructure. One option to communicate and synchronize the extracts is by using a separate server. (See the example in Figure 3.1.) At the basis of data infrastructure design is the metadata (the data about the data) definition. The cost for maintenance and support of the internal data infrastructure depends on the internal cost structure derived by corporate IT.
3.4.2 External Data Infrastructure
Usually, the data about potential economic drivers are not available internally and need to be delivered by external sources. Examples of such sources are the Bloomberg services5 with various types of financial data, such as equities, commodities, foreign exchange rates, and the Global Insight services6 with more than 30 million time series of different nature across the globe, such as prices, economic indicators, and labor costs. The external data are generally consistent, collected in a timely manner, and some have forecast values for a given forecasting horizon. The last feature is very beneficial in the case of using these data as inputs in the multivariate in X forecasting models.
There are two options for delivering external data. The first one is based on accessing the necessary data by direct extracts from the key sources. The second option is based on building an internal database of the most frequently used external data. The advantage of the second approach is the synchronized update of all needed external data, fast search of the specific economic drivers, and more reliable maintenance of deployed models. However, this option requires allocating internal resources for the design and maintenance of the database and training of potential users.
An example of integrating different external and internal data sets in a data set that is appropriate for data mining in forecasting is shown in Figure 3.2. It includes three external data sets (Bloomberg, Global Insight, and CMAI) and two internal data sets. The different data are integrated in the forecasting data set based on a selected starting time and time interval (month, quarter, or year). Those time series with different time intervals are appropriately expanded or contracted in a previous step as described in Chapter 6.
Figure 3.2: Integrating external and internal data in a data set ready for data mining in forecasting analysis
The cost of maintaining and supporting the external data infrastructure depends on the subscription services cost, the cost of developing and maintaining an internal database, and the internal cost of corporate IT.
3.5 Organizational Infrastructure
The objective of this section is to give the reader possible ways to build an organizational infrastructure for data mining in forecasting in a business. We briefly discuss organizing model developers and forecasting users, selecting a proper work process, and integrating everything into the corporate IT environment.
3.5.1 Developers Infrastructure
A key strategic business decision related to a forecasting organization is deciding how much to invest in people that can develop forecasting models. The type of the forecasting development effort and its size depend on the projected demand for forecasting projects in the organization. Other factors that have to be taken into account are as follows:
the available internal personnel in corporate IT who can support forecasting models by managing the data, infrastructure, and operations
the strategic commitment of key users for time and resources
the available internal skills in the area of modeling, statistics, data mining, and forecasting
the level of experience in applying forecasting projects
Below we briefly discuss three ways to organize developers: (1) external consultant services, (2) distributed developers in organizations (key users of forecasting services), and (3) a centralized group of developers.
External consultant services
This is the minimum-investment solution for when you have low expected demand, no strategic commitment, and a lack of internal resources. The only allocated internal resources are for project management and interaction with the external consultants. However, even in this case, some basic training for forecasting and statistics is recommended. It is preferable to have a well-prepared test case when you begin the working relationship with the external consultants. (Some suggestions on how to prepare an effective test case are given by Michael Gilliland in his book The Business Forecasting Deal.) The key advantage of this solution is the minimum cost. The key disadvantage is the total dependence on external resources.
Distributed developers
This organizational structure is appropriate in small or medium-size businesses when the demand for forecasting services is concentrated in several key users, such as marketing and sales, supply chain, and purchasing. Often they prefer to own the whole model development and deployment process and hire experts with forecasting knowledge. In many cases they do not invest in the high-end hardware and software infrastructure, such as SAS Forecast Studio. The key advantage of this solution is the availability to implement forecasting capabilities with internal resources in appropriate business functions at an affordable cost. The key disadvantage is the limited capacity for growth.
Centralized developers group
The best-case scenario for applying data mining in forecasting in larger organizations is by building a centralized group of developers. The group must have the capacity to respond fast to the growing demand of forecasting projects from various sections of a large corporation. The skill set of the developers' team must have a proper balance between system and data support expertise and modeling capabilities in the area of statistics, data mining, and forecasting. An example of key roles in a centralized group of data mining for forecasting is given below.
The system administrator maintains servers, upgrades software, handles security issues, and interacts with IT.
The data administrator maintains data integrity, identifies internal and external data sources, and collects and harmonizes data.
The modeler interacts with clients, identifies system structure and data, pre-processes the data, performs variable reduction and selection, develops, validates, implements, and maintains forecasting models.
The manager manages the group, delivers needed resources, and brings in projects.
The proper place of this group within a large organization is in the centralized corporate business services. This group serves all potential users so that the return of investment is maximized. The size of the group depends on projected demand. However, at least five to seven developers are needed to be efficient. It is assumed that a period of at least two to three years is needed for the group to establish itself by building infrastructure, hiring, learning, promoting to potential clients, and developing test projects. The funding during this period is centralized and gradually gives way to a self-support mode where projects are supported directly by their clients. The key issue that will determine the fate of this group is whether a sustainable project pipeline can be maintained.
3.5.2 Users Infrastructure
Forecasting users come from different parts of the organization. Typical clients for statistical forecasting services are the marketing, sales, financial, purchasing, and operations planning departments. Forecasting users can be classified in the following four categories, briefly discussed below: (1) forecasting reports users, (2) planners, (3) decision-makers, and (4) top level managers. (A similar user classification for demand-driven forecasting is described in detail in Charles Chase's book Demand-Driven Forecasting: A Structured Approach to Forecasting.)
Forecasting reports users
These are the users who passively use the delivered forecasts for information purposes only without making direct business decisions based on specific forecasting results or participating in judgmental forecasting or process planning. Most of the top managers are in this category. Recently many businesses have included forecasts in their regular performance tracking reports distributed to middle-to-top-level managers. The value of forecasting for this category of users is in giving them an awareness of the projected directions of the key performance indicators of interest.
Planners
In contrast to the previous category, planners actively use the delivered statistically based forecasting models in developing their sales, marketing, or operations plans. Very often they also have the right to override the statistical forecasts with their judgmental estimates. In the case of demand-driven forecasting, these are the users in marketing and sales who “shape the demand” based on analytics and domain knowledge. From all the categories of forecasting users, planners are the most educated and directly involved in the model development and deployment loops. They have the decisive role in introducing expert knowledge by defining events, evaluating model performance, and making the final forecasts adjustments. Planners also have the responsibility to recommend to the decision-makers which developed plans, based on the delivered and adjusted forecasts, get final approval.
Decision-makers
This category of forecasting users includes the middle-layer managers at the departmental level who are responsible for the results of the plans recommended by the planners. They also make the final decision for implementing the plans. Part of the decision-making process is balancing the recommended statistically-driven forecasts from the experts (planners and model developers) and the top management push. Often the decision-making process goes through several iterations until a consensus is reached. This category of users is critical for the success of specific forecasting projects and the overall forecasting activities in the business. Success for decision-makers is not based on model performance measured by forecasting accuracy but is based on the expected value measured by the key performance indicators (KPIs).
Top level managers
This category includes the top executives related to finances, IT, and operations. As users, they might have different roles. One critical role is to establish and support financially, for some period of time, the forecasting capabilities in the organization. Executives might request forecasting projects for developing a business strategy as well. It is expected that at any moment the top executives can access the forecasting reports at any level of the organization and keep track of the KPIs. And finally they can actively influence what decisions are made regarding the implementation of the action plans based on forecasting models.
3.5.3 Work Process Implementation
A key component in developing the organizational infrastructure is selecting and implementing an appropriate work process for data mining in forecasting. An example of such a work process is given in Chapter 2. It is also very important to integrate the selected work process with the existing corporate culture. The best-case scenario is to consolidate the data mining for forecasting work process with the existing standard work processes in the organization. If you can do so, the implementation cost and the time for integration into the corporate culture will be significantly reduced. An example of integration with the most popular work process in industry, Six Sigma, is described in Chapter 2. Another example of a popular work process in the case of demand-driven forecasting – Sales & Operation Planning (S&OP) is given in Chase 2009.
3.5.4 Integration with IT
An organizational issue of critical importance for the final success of applying data mining for forecasting is the smooth integration with corporate IT services. Unfortunately the integration process can be bumpy largely due to the different mode of operation of IT. The IT department is often focused primarily on implementing standard solutions across the business. The focus of data mining for forecasting is on delivering custom and, therefore, nonstandard solutions using specialized software. It is a well-known fact that maintenance and support of data mining for forecasting systems requires specialized expertise rather than the typical skill sets in corporate IT. One potential solution to this problem is allocating the specialized system support within the developers group. Part of the responsibilities of the developers' group system administrator is to coordinate all activities with IT. While establishing the developers group, however, support from top IT management is needed to promote the necessary changes beyond the IT standards.
3 A good comparison between SAS Enterprise Miner, IBM SPSS, and StatSoft STATISTICA Data Miner is given in the Handbook of Statistical Analysis & Data Mining Applications (Nisbet et al. 2009).
4 http://www.sas.com/technologies/dw/etl/access