Читать книгу Data Management: a gentle introduction - Bas van Gils - Страница 25
Оглавление
Synopsis - Professionals in the field of DM/ IT use a specific lingo. Unfortunately, the terminology is not as standardized as I would like. In this chapter, I will give an overview of the most important terms as they are used in this book.
■ 6.1 INTRODUCTION
Professionals in the DM/ IT field have a reputation for being precise and consistent. Since this is the case, you would expect that terminology used would also be highly standardized and precisely defined. A careful study of several International Organization for Standardization (ISO)/ International Electrotechnical Commission (IEC) standards such as [ISO07, ISO12, ISO15] and comparison with the DMBOK and TOGAF [Hen17, The11] shows that this is far from the truth: the terminology is vaguely the same but precisely different.
In my view, this is a bad thing: how can we successfully engage people who are not so data-literate when we cannot agree on basic terminology? At the same time, I am very much aware that changing how people use language is far from easy. The purpose of this chapter is to introduce the terminology that is used in this book. The guiding principle is to align as much as possible with the aforementioned standards.
A small warning: defining terminology is both an art and a science. The text that follows is a little more academic in nature than in the rest of this book.
■ 6.2 DATA CODIFIES WHAT WE KNOW ABOUT THE WORLD
In chapter 2, I briefly discussed the data/ information dichotomy and claimed that, at least for purposes of this book, there is little difference between these two terms. In my view, data codifies what we know about the world in the form of text, numbers, graphs, images and so on1. In more technical terms, this means that data can be either structured (which typically means it is in tabular form), unstructured (such as a random piece of text), or semi-structured (such as an e-mail, which consists of a header and main body).
Linguistically, the term data is the plural form of the term datum, which at least suggests that there is something such as a “basic building block for data”. I’ll call this a data point. I’ll use the term record to signify a (semi) structured group of data points that belong together, similar to paper records in physical catalogs used prior to today’s digital data storage. Note that records typically consist of several standardized fields that are filled in with actual data points. The easiest way to understand what a field is, is to think of a record as a form with predefined fields that can be filled in. Last but not least, a group of records together forms a data set. Example 10 illustrates these definitions.
Example 10. Data, data point, record, field and data set
The diagram shows data that is stored in a system (outer box). The small inner boxes signify the records in this system. Each record has three fields: the name, birthday and birth city of a per- son. In this example there are six records in total, each having three data points matching the three fields that make up a typical record. The top row is grouped (dashed box): this signifies the data set with records about people that were born in Tilburg. Another potential data set would be: the group of all records for people born before 1960.
■ 6.3 STORING DATA IN SYSTEMS
In the example, I use the word system. In this context, I use the term to signify a (digital) information system. In this section, I will introduce the terminology that is related to how data is stored in systems2. Systems typically have one or more data stores: parts of the system that are concerned with storing data. Defining different areas for storing data can be useful for different reasons such as privacy and security (a data store with privacy-sensitive data requires more security measures)3 or performance (data stores that are critical to the performance of a process may have extra computing power assigned to them).
Data is stored in systems in various ways. By far the most common way to structure data in systems is through tables such that each row of the table maps to a record (see also section 4.5). More precisely put: the column headings of the table match the names of the fields in the record, and the intersection of rows and columns (the “cells” of the table) contain the individual data points. Example 11 builds on the previous example and illustrates these definitions.
The diagram uses a model fragment to show how tables in a data store are defined. Modeling is an important part of DM. Data models – as well as other types of models – are explained in more detail in chapter 11.
Example 11. Storing data in tables
The lower part of the diagram is taken from the previous example and shows three person-records. However, this time each record also has a unique ID. The top part of the diagram shows the definition of what a typical record looks like. It shows that each record has four fields and also shows the data type. Last but not least, it shows whether a field is automatically generated or not.
The example has two tables that are related through a dependency. These links between tables make it possible to answer questions such as “show me all orders where the customer was born before 1960”.
■ 6.4 DATA IN PROCESSES
The previous section discussed data from an IT perspective. In this section, I will switch gears and discuss data from a business (process) perspective. This is a major shift to another level of abstraction: rather than considering exactly how data is structured and stored in systems, this perspective is all about understanding which type of data is required to make processes run.
Every process has inputs and outputs which may be data or something physical. These inputs and outputs can be described using business concepts4. Business concepts are defined as “the things that business stakeholders talk about”. When talking about business concepts, you completely ignore how data is structured and stored in systems.
One of the things that is key for good data management is that these business concepts are clearly defined. This often leads to the creating of a (business) glossary. The glossary is discussed in further details in chapters 10 and 28. By studying these definitions, it often becomes clear which business concepts are related. These relationships can be documented in a conceptual data model, which will be discussed in chapter 11 (see also section 4.4 on information/ data analysis).
Example 12 illustrates the main points from this discussion.
Example 12. Data in processes
The diagram shows a single invoicing process which has an order as input and an invoice as output. These business concepts are related to each other, as well as to other business concepts. The solid arrows indicate these relationships. The labels on these relationships give an indication of how to interpret them.
■ 6.5 CONNECTING THE BUSINESS AND IT PERSPECTIVE
The questions that remain are: how are business concepts stored in systems? How are the business and IT perspectives connected? When database systems became popular in the 1970s, a technique was developed to analyze and “normalize” data structures in an effective manner: the relational model [Cod70, Cod79, Dat12] (see also section 4.5). Around the same time, various modeling approaches were developed to visualize what these data structures should look like. Chief among them was the Entity Relationship Model [Che76]. The main idea behind this type of modeling approach is to analyze how business concepts should be structured in such a way that they can efficiently be stored in database systems. This level of analysis straddles the business and IT perspectives. Models at this level of abstraction are often called logical data models, something which will be discussed in more detail in chapter 11.
What is relevant for purposes of this chapter is that business concepts and their relationships are transformed into a logical structure of data elements, which can be either entities or attributes of these entities. As with business concepts, entities can also be connected through relationships (hence the name Entity Relationship Diagram (ERD) that is frequently used). Example 13 explains this further.
Example 13. Data elements
The diagram shows four entities, each with several attributes. Even more, the entities are related and there is a verbalization attached to each relationship. Compare this diagram, which lists data elements to the diagram in example 12, which lists business concepts. The diagram with business concepts lists the things that business talks about. Apparently, order line is not something business stakeholders talk about, or else it would have shown up as a business concept. However, in order to store data in the system in an effective manner, the order line is needed as it stores the combination of products and required quantity for a specific order.
This small example, of course, doesn’t show all the intricacies of going from the level of business concepts to the level of data elements. The purpose of the example is only to show that the relationship between business concepts and data elements is complicated at best5. Mapping business concepts to data elements is only one part of the analysis, though. The second part consists of mapping the data elements to tables and columns. This is a far more straightforward process: typically, entities map on tables and attributes map on columns6.
■ 6.6 OUTLOOK
The goal of this chapter was to discuss base terminology in the field of data management. Important terms are business concept, data element, entity, attribute, table, column, field, and record. In addition to introducing important terminology, this chapter expanded on definitions with examples and created links to other chapters. By doing so, this chapter provides a basis for a consistent and complete framework for data management that can be used in practice.
■ 6.7 VISUAL SUMMARY
1 The classic works on information theory such as [Sha48] provide more insight in the use of the word codifies.
2 For the tech-savvy readers: in this chapter, I will mainly focus on data that is stored in relational databases. The terminology mostly fits with other structures (e.g. NoSQL [RW12]) as well.
3 A more extensive discussion of data security can be found in chapter 17.
4 Many good words are being used in literature, such as “business term”, “business object” and “business concept”. I went with the latter because this makes it easy to align with the notion of conceptual data models that is introduced in chapter 11.
5 If you are interested in this process, look up a good reference work on normalization in database systems such as [Dat04].
6 There are exceptions to the rule and the underlying database technology should be taken into account. This is, however, beyond the scope of this discussion.