Читать книгу Data Management: a gentle introduction - Bas van Gils - Страница 31

Оглавление

Synopsis - In this chapter, I will give a high-level overview of the distinction between five different types of data: transaction data, master data, business intelligence data, reference data, and metadata. For each, I will also provide links to other chapters.

■ 8.1 CLASSIFYING DATA

Most organizations have large amounts of data. This is a well-known fact and one of the reasons why DM is such an important topic. What’s more, they typically also have many different types of data. Classifying data can be useful for different purposes. For example, it may help to decide on the approach to DM, or to decide what type of media it should be stored on. Many different classification schemes have been proposed. This is illustrated in example 16.

Example 16. Data classification

Data can be classified to indicate the type of use: descriptive data (describe a state of affairs in the real-world), diagnostic data (show how well something – e.g. a process – is functioning), predictive data (make predictions about a future state of affairs), or prescriptive data (define parameters to ensure that a certain process or system performs as desired).

Another way to classify data is to consider what it describes: i.e. geographic data (what a specific area looks like), weather data (past/ present/ future weather for a specific area), and people data (such as names, addresses, and relationships to other people).

While useful, these types of classifications are not the main topic of this chapter. Instead, I will look a level deeper and consider five related types of data. I already hinted at these in table 7.1 where I gave an overview of the DM topics that I will discuss in this book.

■ 8.2 FIVE FUNDAMENTALLY DIFFERENT TYPES OF DATA

In this section, I will give a high-level overview of five fundamentally different data types and indicate in which chapter I will discuss these further. The point is not so much to give an extensive discussion here but to make the reader aware that there are different types of data before launching into detailed discussions about governance, architecture, etc. in future chapters. Figure 8.1 outlines the five types of data.

Figure 8.1 Five types of data

■ 8.3 TRANSACTION DATA

The first type of data is transaction data. This type of data usually provides a description of some event that took place in the real-world, such as a purchase, a payment of an invoice. Assuming business goes well, you will typically have many records of this type that are created every day: every time someone makes a purchase or payment, for example. Also note that these records tend to be highly structured and you want to keep track of all of them so that you can later analyze how business is really going.

■ 8.4 MASTER DATA

The second type of data is master data. To understand what this is about, consider a situation where you have half a dozen systems where you store data about your customers. One of your customers calls with a complaint. In which system are you going to look to find out what is going on? Even more, how are you going to deal with the situation where systems are in disagreement (one system says this customer has his office in Amsterdam, whereas the other claims it is in Rotterdam)?

To tackle challenges of this type, organizations typically want to organize a “golden record” or “single version of the truth” which must show what the organization believes to be true. There are many ways to implement this as we will see in chapter 15. This is both complex and costly and organizations typically only do this for their most important business concepts, such as Party / Customer, and Product. Typically, this type of data does not change all that often (ask yourself this: how often do people move or change their name? How often do you introduce new/ retire old products?). Example 17 shows that transaction data may also contain (references to) master data objects.

Example 17. Master data & transaction data

Suppose that you have just sold a product called Cool8 to a customer whose name is John Doe. The record of this transaction will show such things as a time stamp, the actual store where the purchase was made, which employee was involved and so on.

From a master data perspective, two business concepts are of interest: the customer and the product. This customer may have made previous purchases at this store, or perhaps at other stores. If this customer purchases a lot of our Cool8 product then this may be useful to know. If this customer used to purchase Cool7 and has now switched to Cool8 then it may also be useful to find out why and what that implies for future sales.

Now, suppose that John Doe did, in fact, make purchases at various stores but under different names (John Doe, John H. Doe, John Howard Doe). Can we reconcile this? Can we figure out with any degree of certainty who is who and which products were purchased when Mr. Doe calls with a complaint?

■ 8.5 BUSINESS INTELLIGENCE DATA

Most transaction systems only hold the last version of data. This means that when a customer moves from A to B, then the fact that he used to live at A is often lost. Transaction systems typically also store data at the finest level of granularity. For (historic) reporting and (predictive) analytics this may not always be the best solution. This is where the category of business intelligence (BI) data comes into play. The idea is to create data sets of transaction data and master data. This data set contains historic data for timeline analysis. The data set is structured in such a way that data can be easily aggregated and summarized for reporting and analysis purposes. Chapter 18 will discuss BI in more detail. Example 18 illustrates this point.

Example 18. BI data

Suppose your company has a number of product lines: the CoolX line of products as well as several others. Even more, the company also offers various services to customers. Separate systems keep track of all purchases, services requests, payments and so on.

From a reporting perspective, management may be interested in questions such as: how many products of a certain type did we sell per store and how does that deviate from previous quarters? Do we retain our customers when they move? A similar line of reasoning applies to analytics questions such as: what would be a good service to cross-sell with our CoolX product to a specific group of customers? To be able to answer these questions, data must be consolidated. Often this means looking at the data from a historic perspective. Even more, individual records are less important in this situation than the patterns that are present in the data.

Whether this type of data is updated frequently depends on the architecture of your information systems landscape. In some cases, updates in data from transaction systems and master data systems are pushed to the BI environment once or twice a day. In other situations, this is done in (near) real-time.

■ 8.6 REFERENCE DATA

The fourth type of data is reference data which is perhaps the most elusive of all. Reference data is used to make sense of other data, often through codes or hierarchies of codes. The idea is that by using a code, you give a very precise meaning to something that is potentially very complex. Example 19 gives two simple examples.

Example 19. Reference data

The simplest examples are look-up lists such as zip codes in the US, or a list of all valid country names. By comparing the zip code/ country that a customer tells us, we can immediately assess whether the data he provides is at least valid^a.

As a more complex example, consider the use of industry classification codes to label organizations you do business with. For example, code 440000 is all retail traders, 445000 is a child of 440000 and is the code for food and beverage stores. Code 445200 is a child of 445000 and signifies specialty food stores such as 445210 (meat markets), 445220 (fish and sea food markets), and 445290 (other specialty stores). Using such codes consistently allows us to easily find all specialty food stores by looking for all stores that are labelled 445000 or one of its sub-codes.

^a The issue of data quality dimensions such as validity (is a value allowed according to some criteria) versus correctness (is it a true representation of the real world) is part of the discussion in chapter 16.

Reference data may seem like a really simple and straightforward concept yet in practice this is hardly the case. In chapter 14, I will discuss the relevant theory in more detail. Also note that reference data tends to be static. Using reference data in real-world situations will be discussed in more detail in the examples in part II in this book.

■ 8.7 METADATA

The fifth and last type of data that I will discuss is metadata. Loosely defined, metadata is “data about data”. Anything you can know about your data is metadata. Through metadata you can answer questions such as: what is the definition of “customer”? In which processes do we create customer data? How does customer data flow through our information systems? The list goes on and on. As an organization you can (and perhaps should) collect metadata about all other types of data. Having a good set of metadata available is foundational for managing and governing your data. Metadata is discussed in more detail in chapter 10.

■ 8.8 VISUAL SUMMARY

Подняться наверх