Читать книгу Data Cleaning - Ihab F. Ilyas - Страница 10

Оглавление

Introduction

Enterprises have been acquiring large amounts of data from a variety of sources in order to build large data repositories that power their applications, with the goal of enabling richer and more informed analytics. Data collection and acquisition often introduce errors in data, e.g., missing values, typos, mixed formats, replicated entries for the same real-world entity, and violations of business and data integrity rules. A survey about the state of data science and machine learning (ML) reveals that dirty data is the most common barrier faced by workers dealing with data.¹ With the popularity of data science, it has become increasingly evident that data curation, unification, preparation, and cleaning are key enablers in unleashing the value of data.² According to another survey of about 80 data scientists conducted by CrowdFlower and published in Forbes,³ data scientists spend more than 60% of their time in cleaning and organizing data, and 57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems.

Regardless of the type of data errors to be fixed, data cleaning activities usually consist of two phases: (1) error detection, where various errors and violations are identified and possibly validated by experts; and (2) error repair, where updates to the database are applied (or suggested to human experts) to bring the data to a cleaner state suitable for downstream applications and analytics. Error detection techniques can be either quantitative or qualitative. Specifically, quantitative error detection techniques often involve statistical methods to identify abnormal behaviors and errors [Hellerstein 2008] (e.g., “a salary that is three standard deviations away from the mean salary is an error”), and hence have been mostly studied in the context of outlier detection [Aggarwal 2013]. On the other hand, qualitative error detection techniques rely on descriptive approaches to specify patterns or constraints of a consistent data instance, and for that reason these techniques identify those data that violate such patterns or constraints as errors. For example, in a descriptive statement about a company HR database, “for two employees working at the same branch of the company, the senior employee cannot earn less salary than the junior employee,” if we find two employees with a violation of the rule, it is likely that there is an error in at least one of them.

Various surveys and books detail specific aspects of data quality and data cleaning. For example, Rahm and Do [2000] classify different types of errors occurring in an Extract-Transform-Load (ETL) process, and survey the tools available for cleaning data in an ETL process. Some work focuses on the effect of incompleteness data on query answering [Grahne 1991] and the use of a Chase procedure [Maier et al. 1979] for dealing with incomplete data [Greco et al. 2012]. Hellerstein [2008] focuses on cleaning quantitative numerical data using mainly statistical techniques. Bertossi [2011] provides complexity results for repairing inconsistent data and performing consistent query answering on inconsistent data. Fan and Geerts [2012] discuss the use of data quality rules in data consistency, data currency, and data completeness, and their interactions. Dasu and Johnson [2003] summarize how techniques in exploratory data mining can be integrated with data quality management. Ganti and Sarma [2013] focus on an operator-centric approach for developing a data cleaning solution, involving the development of customizable operators that can be used as building blocks for developing common solutions. Ilyas and Chu [2015] provide taxonomies and example algorithms for qualitative error detection and repairing techniques. Multiple surveys and tutorials have been published to summarize different definitions of outliers and the algorithms for detecting them [Hodge and Austin 2004, Chandola et al. 2009, Aggarwal 2013]. Data deduplication, a long-standing problem that has been studied for decades [Fellegi and Sunter 1969], has also been extensively surveyed [Koudas et al. 2006, Elmagarmid et al. 2007, Herzog et al. 2007, Dong and Naumann 2009, Naumann and Herschel 2010, Getoor and Machanavajjhala 2012].

This book, however, focuses on the end-to-end data cleaning process, describing various error detection and repair methods, and attempts to anchor these proposals with multiple taxonomies and views. Our goals are (1) to allow researchers and general readers to understand the scope of current techniques and highlight gaps and possible new directions of research; and (2) to give practitioners and system implementers a variety of choices and solutions for their data cleaning activities. In what follows, we give a brief overview of the book’s scope as well as a chapter outline.

Figure 1.1 A typical data cleaning workflow with an optional discovery step, error detection step, and error repair step.

1.1 Data Cleaning Workflow

Figure 1.1 shows a typical data cleaning workflow, consisting of an optional discovery and profiling step, an error detection step, and an error repair step. To clean a dirty dataset, we often need to model various aspects of this data, e.g., schema, patterns, probability distributions, and other metadata. One way to obtain such metadata is by consulting domain experts, typically a costly and time-consuming process. The discovery and profiling step is used to discover these metadata automatically. Given a dirty dataset and the associated metadata, the error detection step finds part of the data that does not conform to the metadata, and declares this subset to contain errors. The errors surfaced by the error detection step can be in various forms, such as outliers, violations, and duplicates. Finally, given the errors detected and the metadata that generate those errors, the error repair step produces data updates that are applied to the dirty dataset. Since there are many uncertainties in the data cleaning process, external sources such as knowledge bases and human experts are consulted whenever possible and feasible to ensure the accuracy of the cleaning workflow.

Example 1.1 Consider Table 1.1 containing employee records for a U.S. company. Every tuple specifies a person in a company with her id (GID), name (FN, LN), level (LVL), zip code (ZIP), state (ST), and salary (SAL). Suppose a domain expert supplies two data quality rules for this table. The first rule states that if two employees have the same zip code, they must be in the same state. The second rule states that among employees working in the same state, a senior employee cannot earn a smaller salary than a junior employee.

Table 1.1 An example employee table

Given these two data quality rules, the error detection step detects two violations. The first violation consists of four cells {t₁[ZIP], t₁[ST], t₃[ZIP], t₃[ST]}, which together violate the first data quality rule. The second violation consists of six cells {t₁[ROLE], t₁[ST], t₁[SAL], t₂[ROLE], t₂[ST], t₂[SAL]}, which together violate the second data quality rule. The data repair step takes the violations and produces an update that changes t₁[ST] from “NM” to “NY”, and the new data now has no violation with respect to the two rules.

1.2 Book Scope

The aforementioned data cleaning workflow describes a general purpose data cleaning process, but there are different data cleaning topics that address one or multiple steps in the workflow. We cover some of the most common and practical cleaning topics in this book: outlier detection, data deduplication, data transformation, rule-based data cleaning, ML guided cleaning, and human involved data cleaning. We briefly explain these topics in the following subsections; we also highlight the book structure in Section 1.2.7.

1.2.1 Outlier Detection

Outlier detection refers to detecting “outlying” values. While an exact definition of an outlier depends on the application, there are some commonly used definitions, such as “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] and “an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett and Lewis 1994]. For example, for a company whose employees’ salaries are around $100,000, an employee with a salary of $10,000 can be considered to be an outlier.

Applications of outlier detection include network intrusion detection, financial fraud detection, and abnormal medical condition detection. As a concrete example, imagine a company that is interested in improving road safety by making drivers more aware of their driving habits. To achieve this, data is collected from many hundreds of thousands of vehicles equipped with sensors connected with smart-phones. The collected data includes phone model, trip length, battery drain, and so on. Outlier detection and explanation engines such as Macrobase [Bailis et al. 2017] can be used to analyze the collected data. For example, Macrobase can find that some trips showed abnormally short lengths, common in smartphones with Apple iOS 9.0 beta 1. This reason was reported to the engineers, who discovered that a buggy Bluetooth stack was introduced in OS 9.0 beta 1, preventing iOS devices from connecting to in-car sensors.

Outlier detection faces two main challenges. First, defining what is a normal data pattern and what is an outlier can be difficult as different data and applications differ in what is considered normal. Many different detection techniques have been proposed to define normal behavior. Second, many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the curse of dimensionality.

1.2.2 Data Deduplication

Duplicate records occur for many reasons. Data deduplication, also known as duplicate detection, record linkage, record matching, or entity resolution, refers to the process of identifying tuples in one or more relations that refer to the same real-world entity. For example, a customer might be recorded multiple times in a customer database if the customer used different names when purchasing; a single item might be represented multiple times in an online shopping site; and duplicate records might appear after a data integration project because that record had different representations in original data sources. A data deduplication process usually involves many steps and choices, including designing similarity metrics to evaluate the similarity for a pair of records, training classifiers to determine whether a pair of records are duplicates, clustering all records to obtain clusters of records that represent the same real-world entity, consolidating clusters of records to unique representations, designing blocking or distributed strategies to scale up the deduplication process, and involving humans to decide whether a record pair are duplicates when machines are uncertain.

To address this, many open-source and commercial data deduplication tools have been built, such as Magellan [Konda et al. 2016] and Data Tamer [Stonebraker et al. 2013] (later commercialized as Tamr⁴). Fortune 500 companies use these tools to make sense of their large procurement data. Large companies often have multiple business units, and each unit buys many parts and products from many suppliers. These units might be buying the same part and product from the same supplier without each other’s knowledge, and therefore cannot get the best pricing. Furthermore, the same part and product might be named differently across different units, which complicates the deduplication process. A great deal of money can saved by identifying multiple records from the same supplier.

Achieving good precision and recall at the same time is difficult in data deduplication—declaring all pairs are duplicates achieves perfect recall but poor precision while declaring no pairs as duplicates achieves perfect precision, but poor recall. The problem is especially challenging given the myriad of design choices in designing a data deduplication workflow. Furthermore, data deduplication is inherently a combinatorial task that has quadratic complexity. For example, when done naively, comparing all pairs of only 1000 records requires 499,500 comparisons. In addition, grouping records that refer to the same real-world entity can be even harder. For example, correlation clustering used for grouping tuples that represent the same real-world entity is an NP-hard problem [Elsner and Schudy 2009].

1.2.3 Data Transformation

Data transformation refers to the task of transforming data from one format to another, for example, transforming phone numbers to a standard format by adding “-” in between digits. Data transformations can be seen as error repair activities and are used at various stages of data analytics. For example, before running a data integration project, transformations are often used to standardize data formats, enforce standard patterns, or trim long strings. Transformations are also used at the end of the ETL process, for example, to merge clusters of duplicate records.

Transform-Data-by-Example is an Excel add-in that finds the desired transformation easily for a given transformation task.⁵ Only a few examples of the desired output are needed, and Transform-Data-by-Example automatically finds relevant data transformation functions from a large collection that it has already indexed. This collection is acquired from a variety of sources, including Github source codes, Stackoverflow code snippets, and .Net libraries. Users can also extend the collection to cover domain-specific transformation capabilities by providing their own transformation functions.

Often, no ground truth is available to train or create accurate transformation solutions, so there might be an infinite number of applicable transformation programs. Although the desired transformation might be clear to a human expert, running all of these possibilities is expensive. Thus, practical data transformation tools must effectively prune the space interactively to minimize cost.

1.2.4 Rule-Based Data Cleaning

Another common way to detect and rectify errors in databases is through the enforcement of data quality rules, often expressed as integrity constraints (ICs) [Chu et al. 2013b, Kolahi and Lakshmanan 2009, Fan and Geerts 2012]. An example data quality is “For two employees working at the same branch of a company, the senior employee cannot have less vacation time than the junior employee.” Given a dataset, data quality rules can either be manually designed by domain experts who understand the semantics of the data, or be automatically mined from the data [Chu et al. 2013a, Chiang and Miller 2008]. In this context, (qualitative) error detection is the process of identifying violations of ICs, namely, subsets of records that cannot co-exist, and error repair is the exercise of modifying the database, such that the violations are resolved and the new data conforms to the given ICs.

Rule-based data cleaning techniques using denial constraints [Chu et al. 2013a, Chu et al. 2013b] have been proven to be extremely valuable in detecting and repairing errors in real data in consultation with animal scientists at UC Berkeley studying the effects of firewood cutting on small terrestrial vertebrates (birds, amphibians, reptiles, and small mammals) [Tietje et al. 2018]. Each record in the dataset contains information about capturing an animal, including the time and the location of the capture, the tag ID of the animal captured, and the properties of the animal at the time of the capture, such as weight, species type, gender, and age group. The dataset was collected over the past 20 years. Since every capture was first recorded in a log book and then later transcribed into Excel sheets, there are missing values, typos, and transcribing errors in the dataset. A dozen discovered data quality rules, expressed in the form of denial constraints [Baudinet et al. 1999], are used to detect and repair data errors. This helped the animal data scientists to correct hundreds of errors that would otherwise be very difficult to spot by humans.

Deriving a comprehensive set of integrity constraints that accurately reflects an organization’s policies and domain semantics is a difficult task. Data is often scattered across silos; for example, different departments may keep their personnel records separate and may have different policies to determine an employee’s salary. Furthermore, data quality rules of varying expressiveness can be found in an organization, ranging from ad-hoc program scripts to simple sanity checks. Rather than consulting domain experts, techniques to automatically discover ICs can be used, which need to balance the trade-off between the expressiveness of the ICs and the efficiency of discovery. Given a set of defined ICs and their violations in a dirty data set, the number of possible ways to update the data to solve the violations is often too large for humans to examine exhaustively. Thus, data cleaning algorithms must be able to search through the huge space of possible updates efficiently to suggest the most plausible updates.

1.2.5 ML-Guided Cleaning

One can easily recognize, from the discussion so far, that ML is a natural tool to use in several tasks: classifying duplicate pairs in deduplication, estimating the most likely value for a missing value, predicting a transformation, or classifying values as normal or outliers. Indeed, several of the proposals that we discuss in this book use ML as a component in the proposed cleaning pipeline. We describe these in the corresponding chapters, and in Chapter 7 detail the use of ML for cleaning. However, with the rapid advancement in scaling inference and deploying large-scale ML models, new opportunities arise, namely, treating data cleaning as an ML task (mainly statistical inference), and dealing with all the aforementioned problems holistically. HoloClean [Rekatsinas et al. 2017a] is one of the first ML-based holistic cleaning frameworks, built on a probabilistic model of unclean databases [De Sa et al. 2019]. We discuss this direction in detail in Chapter 7.

1.2.6 Human-Involved Cleaning

It is usually impossible for machines to clean data errors with 100% confidence; therefore, human experts are often consulted in a data cleaning workflow to resolve various kinds of uncertainties whenever possible and feasible. For example, the metadata (e.g., keys and integrity constraints) generated by the discovery and profiling step need to be verified by humans before they can be used for detecting and repairing errors, since the metadata is discovered based on one instance of the data set and may not necessarily hold on any data instance of the same schema (e.g., future data). Another example of “judicious” human involvement is in the error detection step: to build a high-quality classifier to determine whether a pair of records are duplicates, we need enough training data, namely, labeled records (as duplicates or as non-duplicates). However, there are far more non-duplicate record pairs than duplicate record pairs in a data set; asking humans to label a random set of record pairs would lead to a set of unbalanced training examples. Therefore, humans need to consulted in an intelligent way to solicit training examples that are most beneficial to the classifier.

Humans are also needed in the error repair step: data updates generated by automatic data repair techniques are mostly based on heuristics and/or statistics, such as minimal repair distance or maximum likelihood estimation, and thus are not necessarily correct repairs. Therefore, they must be verified by humans before they can be applied to a data set. We highlight when and how humans need to be involved when we discuss various cleaning proposals in this book.

1.2.7 Structure of the Book

We present the details of current proposals addressing the topics discussed earlier in multiple chapters organized along two main axes, as shown in Figure 1.2: (1) task: detection vs. repair; and (2) approach: qualitative vs. quantitative. For example, outlier detection is a quantitative error detection task (Chapter 2); data deduplication is a qualitative data cleaning task (Chapter 3), which has both detection of duplicates and repair (also known as record consolidation); data transformation is considered a repair task (Chapter 4) that can be used to fix many types of errors, including outliers and duplicates; and rule-based data cleaning is a qualitative cleaning task (Chapter 6). In addition, a discovery step is usually necessary to obtain data quality rules necessary for rule-based cleaning (Chapter 5). Finally, we include a chapter that discusses the recent advances in ML and data cleaning (Chapter 7) spanning both dimensions.

Figure 1.2 Structure of the book organized along two dimensions: task and approach.

Due to the diverse challenges of data cleaning tasks, different cleaning solutions draw principles and tools from many fields, including statistics, formal logic systems such as first-order logic, distributed data processing, data mining, and ML. Although a basic familiarity with these areas should facilitate understanding of this book, we try to include necessary background and references whenever appropriate.

1. https://www.kaggle.com/surveys/2017

2. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

3. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/

4. https://www.tamr.com

5. https://www.microsoft.com/en-us/research/project/transform-data-by-example/

Подняться наверх