Читать книгу Intelligent Credit Scoring - Siddiqi Naeem - Страница 13
Chapter 3
Designing the Infrastructure for Scorecard Development
ОглавлениеAs more banks around the world realize the value of analytics and credit scoring, we see a corresponding high level of interest in setting up analytics and modeling disciplines in-house. This is where some planning and long-term vision is needed. A lot of banks hired well-qualified modelers and bought high-powered data mining software, thinking that their staff would soon be churning out models at a regular clip. For many of them, this did not materialize. Producing models and analytics took just as long to produce as before, or took significantly longer than expected. The problem was not that their staff didn’t know how to build models, or that the model fitting was taking too long. It was the fact that the actual modeling is the easiest and sometimes fastest part of the entire data mining process. The major problems, which were not addressed, were in all the other activities before and after the modeling. Problems with accessing data, data cleansing, getting business buy-in, model validation, documentation, producing audit reports, implementation, and other operational issues made the entire process slow and difficult.
In this chapter, we look at the most common problems organizations face when setting up infrastructure for analytics and suggest ways to reduce problems through better design.
The discussion in this chapter will be limited to the tasks involved in building, using, and monitoring scorecards. Exhibit 3.1 is a simplified example of the end-to-end tasks that would take place during scorecard development projects. These are not as exhaustive as the tasks that will be covered in the rest of the book, but serve only to illustrate points associated with creating an infrastructure to facilitate the entire process.
Exhibit 3.1 Major Tasks during Scorecard Development
Based on the most common problems lending institutions face when building scorecards, we would suggest consideration of the following main issues when looking to design an architecture to enable analytics:
● One version of the truth. Two people asking the same question, or repeating the same exercise should get the same answer. One way to achieve this is by sharing and reusing, for example, data sources, data extraction logic, conditions such as filters and segmentation logic, models, parameters and variables, including logic for derived ones.
● Transparency and audit. Given the low level of regulatory tolerance for black box models and processes, everything from the creation of data to the analytics, deployment, and reporting should be transparent. Anyone who needs to see details on each phase of the development process should be able to do so easily. For example, how data is transformed to create aggregated and derived variables, the parameters chosen for model fitting, how variables entered the model, validation details, and other parameters should preferably be stored in graphical user interface (GUI) format for review. Although all of the above can be done through coding, auditing of code is somewhat more complex. In addition, one should also be able to produce an unbroken audit chain across all the tasks shown in Exhibit 3.1– from the point where data is created in source systems, through all the data transformations and analytics, to scoring and production of validation reports as well as regulatory capital and other calculations. Model documentation should include details on the methods used and also provide effective challenge around the choice of those methods as well as the final scorecard. That means discussion and coverage is necessary for scorecards that were tested and rejected, not just the final scorecard, and competing methods to the one used to produce the scorecard.
● Retention of corporate intellectual property (IP)/knowledge. Practices such as writing unique code for each project and keeping it on individual PCs makes it harder to retain IP when key staff leave. Using programming-based modeling tools makes it more difficult to retain this IP as staff leaving take their coding skills with them. Most modelers/coders also choose to rewrite code rather than sort through partial code work previously written by someone else. This results in delays, and often ends with different answers obtained for the same question. To counter this, many banks have shifted to GUI software to reduce this loss and to introduce standardization.
● Integration across the model development tasks. Integration across the continuum of activities shown in Exhibit 3.1, from data set creation to validation, means that the output of each phase seamlessly gets used in the next. Practices such as rewriting Extract-Transform-Load (ETL) and scoring code, as well as that for deriving and creating variables into different languages is not efficient, as it lengthens the production cycle. It also presents model risk, as recoding into a different language may alter the interpretation of the original variable or condition coded. These would include parameters and conditions for both data sets and models. An integrated infrastructure for analytics also means a lowered implementation risk, as all the components across the continuum will likely work together. This is in addition to the integration and involvement of various stakeholders/personas discussed in the previous chapter.
● Faster time to results. It sometimes takes months to build a model and implement it in many institutions, resulting in the use of inefficient or unstable models for longer than necessary. Efficient infrastructure design can make this process much faster based on integrated components, faster learning cycles for users, and reduction of repetition (such as recoding).
In discussing the points to consider when designing architecture/infrastructure to enable in-house scorecard development and analytics, we will consider the major tasks associated with performing analytics in any organization.
Data Gathering and Organization
This critical phase involves collecting and collating data from disparate (original) data sources and organizing them. This includes merging and matching of records for different products, channels, and systems.
The result of this effort is a clean, reliable data source that ideally has one complete record for each customer and includes all application and performance data for all products owned. This would mean customer’s data from their mortgage, credit card, auto loan, savings and checking accounts, and ATM usage would all be in the same place. Later in the book, we will refer to using such variables in scorecard development. In some banks this is known as the enterprise data warehouse (EDW).
In any analytics infrastructure project, this is typically the most difficult and lengthiest phase. Organizations have dirty data, disparate data on dozens and sometimes hundreds of databases with no matching keys, incomplete and missing data, and in some cases coded data that cannot be interpreted. But this is the most important phase of the project, and fixing it has the biggest long-term positive impact on the whole process. Without clean, trusted data, everything else happening downstream is less valuable. We recognize, however, that waiting for perfectly matched clean data for all products before starting scorecard development, especially in large banks with many legacy systems, is not realistic. There is a reason EDW is known as “endless data warehouse” in far too many places. In order to get “quick hits,” organizations often take silo approaches and fill the data warehouse with information on one product, and then build and deploy scorecards for that product. They then move on to the next set of products in a sequential manner. This helps in showing some benefit from the data warehouse in the short term and is a better approach than waiting for all data for all products to be loaded in.
Конец ознакомительного фрагмента. Купить книгу