Читать книгу Data Cleaning - Ihab F. Ilyas - Страница 12
Оглавление3
Data Deduplication
In this chapter, we discuss a specific data cleaning task, namely, data deduplication. Duplicate records can occur for many reasons. For example, a customer might be recorded multiple times in a customer database if the customer used different names at the time of purchase; a single item might be represented multiple times in an online shopping site; or a record might appear multiple times after a data integration project because that record had different representations in the original data sources. Data deduplication, also known as duplicate detection, record linkage, record matching, or entity resolution, refers to the process of identifying tuples in one or more relations that refer to the same real-world entity. It is often followed by an entity consolidation or fusion process to find the unified representation for duplicate records that best represents the real-world entity.
Example 3.1 Figure 3.1 illustrates a simple example of data deduplication. The similarities between pairs of records are computed, and are shown in the similarity graph (upper-right graph in Figure 3.1). The missing edges between any two records indicate that they are non-duplicates. An edge between two entities indicate a similarity measure, e.g., the similarity score between P1 and P2 in the figure is 0.9. We assume in this example that the score is between 0 and 1, and the higher the score between two entities, the more similar those two entities are.
Records are then clustered together based on the similarity graph. Suppose the user sets the threshold to be 0.5, i.e., any record pairs having similarity greater than 0.5 are considered duplicates. Although P1 and P5 have a similarity score less than 0.5, they are clustered together due to transitivity; that is, they are both considered duplicates to P2.
Finally, all records in the same cluster are consolidated into one record in the final clean relation. Many different strategies can be used to consolidate multiple records. For instance, in consolidating P1, P2, and P5 into one record C1, majority voting is used to obtain the name and the ZIP attributes, and numerical averaging is used to obtain the income attribute.
Figure 3.1 Atypical deduplication task.
The above simple example uses one similarity function to give a score. Oftentimes, multiple similarity functions are used to produce a similarity score between a pair of entities. Machine learning classifiers or other aggregation methods are used to combine these similarity functions into one score, as shown in Section 3.2.
The topic has been extensively covered in many surveys [Koudas et al. 2006, Elmagarmid et al. 2007, Herzog et al. 2007, Dong and Naumann 2009, Naumann and Herschel 2010, Getoor and Machanavajjhala 2012]: some of the surveys provide an extensive overview of all the steps involved in data deduplication [Elmagarmid et al. 2007, Herzog et al. 2007, Getoor and Machanavajjhala 2012]; some focus on the design of similarity metrics [Koudas et al. 2006, Naumann and Herschel 2010]; some discuss the efficiency aspect of data deduplication [Naumann and Herschel 2010]; and some focus on how to consolidate multiple records [Dong and Naumann 2009].
We cover the various aspects of designing a data deduplication workflow and the different choices available in each aspect. Section 3.1 describes multiple similarity metrics and classifies them into three categories: character-based similarity metrics, token-based similarity metrics, and phonetics-based similarity metrics. Section 3.2 discusses different classifiers used to predict whether a pair of entities are duplicates, including Bayes classifiers and active learning approaches. Section 3.3 surveys different clustering algorithms used to group identified pairs into clusters that represent the same real-world entities as shown in the previous example. Section 3.4 discusses different optimization strategies to reduce the number of comparisons by avoiding comparing record pairs that are unlikely to be matches. Section 3.5 presents different distribution strategies to scale out the data deduplication process in a distributed shared-nothing environment. Section 3.6 provides a classification of different fusion strategies to consolidate multiple records that refer to the same entity into one representation. Section 3.7 explores how humans can be involved in the deduplication system/workflow. Finally, Section 3.8 highlights some open-source and commercial tools for data deduplication.
3.1 Similarity Metrics
Measuring the similarity between two values in the same column is an essential component in determining whether two records are duplicates, and a variety of similarity metrics have been proposed to handle different types of errors and different data types (e.g., numerical data and string data). Obviously, the similarity between two numerical values should be directly dependent on the distance between those two values; for example, one way to define the similarity between two numerical values v1 and v2 in attribute A can be defined as the following: NumSim, where max(A) and min(A) denote the maximum value and the minimum value in A, respectively. In contrast to numerical values, there are many choices in defining similarities between two string values. As follows, we describe three categories of similarity metrics that deal with three different common errors, namely, character-based similarity metrics, token-based similarity metrics, and phonetics-based similarity metrics.
3.1.1 Character-based