Читать книгу Database Anonymization - David Sánchez - Страница 13

Оглавление

CHAPTER 3

Anonymization Methods for Microdata

It was commented in Section 2.5 that the protected data set Y was generated either by masking the original data set X or by building it from scratch based on a model of the original data. Microdata masking techniques were further classified into perturbative masking (which distorts the original data and leads to the publication of non-truthful data) and non-perturbative masking (which reduces the amount of information, either by suppressing some of the data or byreducing the level of detail, but preserves truthfulness). This chapter classifies and reviews some well-known SDC techniques. These techniques are not only useful on their own but they also constitute the basis to enforce the privacy guarantees required by privacy models.

3.1 NON-PERTURBATIVE MASKING METHODS

Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail in the original data set.

Sampling

Instead of publishing the original microdata file X, what is published is a sample S of the original set of records [104]. Sampling methods are suitable for categorical microdata [58], but for continuous microdata they should probably be combined with other masking methods. The reason is that sampling alone leaves a continuous attribute unperturbed for all records in S. Thus, if any continuous attribute is present in an external administrative public file, unique matches with the published sample are very likely: indeed, given a continuous attribute and two respondents x_i and xj, it is unlikely that both respondents will take the same value for the continuous attribute unless xi = xj (this is true even if the continuous attribute has been truncated to represent it digitally). If, for a continuous identifying attribute, the score of a respondent is only approximately known by an attacker, it might still make sense to use sampling methods to protect that attribute. However, assumptions on restricted attacker resources are perilous and may prove definitely too optimistic if good quality external administrative files are at hand.

Generalization

This technique is also known as global recoding in the statistical disclosure control literature. For a categorical attribute Xi, several categories are combined to form new (less specific) categories, thus resulting in a new Yi with |Dom(Yi)| < |Dom(Xi)| where |·| is the cardinality operator and Dom(·) is the domain where the attribute takes values. For a continuous attribute, generalization means replacing Xi by another attribute Yi which is a discretized version of Xi. In other words, a potentially infinite range Dom(Xi) is mapped onto a finite range Dom(Yi). This is the technique used in the μ-Argus SDC package [45]. This technique is more appropriate for categorical microdata, where it helps disguise records with strange combinations of categorical attributes. Generalization is used heavily by statistical offices.

Example 3.1 If there is a record with “Marital status = Widow/er” and “Age = 17,” generalization could be applied to “Marital status” to create a broader category “Widow/er or divorced,” so that the probability of the above record being unique would diminish. Generalization can also be used on a continuous attribute, but the inherent discretization leads very often to an unaffordable loss of information. Also, arithmetical operations that were straightforward on the original Xi are no longer easy or intuitive on the discretized Yi.

Top and bottom coding

Top and bottom coding are special cases of generalization which can be used on attributes that can be ranked, that is, continuous or categorical ordinal. The idea is that top values (those above a certain threshold) are lumped together to form a new category. The same is done for bottom values (those below a certain threshold).

Local suppression

This is a masking method in which certain values of individual attributes are suppressed with the aim of increasing the set of records agreeing on a combination of key values. Ways to combine local suppression and generalization are implemented in the μ-Argus SDC package [45].

If a continuous attribute Xi is part of a set of key attributes, then each combination of key values is probably unique. Since it does not make sense to systematically suppress the values of Xi, we conclude that local suppression is rather oriented to categorical attributes.

3.2 PERTURBATIVE MASKING METHODS

Noise addition

Additive noise is a family of perturbative masking methods. The values in the original data set are masked by adding some random noise. The statistical properties of the noise being added determine the effect of noise addition on the original data set. Several noise addition procedures have been developed, each of them with the aim to better preserve the statistical properties of the original data.

• Masking by uncorrelated noise addition. The vector of observations, xi, for the i-th attribute of the original data set Xi is replaced by a vector yi = xi + ei where ei is a vector of normally distributed errors. Let , respectively, the k-th and l-th components of vector ei. We have that and are independent and drawn from a normal distribution . The usual approach is for the variance of the noise added to attribute Xi to be proportional to the variance of Xi; that is, . The term “uncorrelated” is used to mean that there is no correlation between the noise added to different attributes.

This method preserves means and covariances,

However, neither variances nor correlations are preserved

• Masking by correlated noise addition. Noise addition alone always modifies the variance of the original attributes. Thus, if we want to preserve the correlation coefficients of the original data, the covariances must be modified. This is what masking by correlated noise does. By taking the covariance matrix of the noise to be proportional to the covariance matrix of the original data we have:

• Masking by noise addition and linear transformation. In [48], a method is proposed that ensures by additional transformations that the sample covariance matrix of the masked attributes is an unbiased estimator for the covariance matrix of the original attributes.

• Masking by noise addition and nonlinear transformation. Combining simple additive noise and nonlinear transformation has also been proposed, in such a way that application to discrete attributes is possible and univariate distributions are preserved. Unfortunately, the application of this method is very time-consuming and requires expert knowledge on the data set and the algorithm. See [44] for more details.

Noise addition methods with normal distributions are naturally meant for continuous data, even though some adaptations to categorical data have been also proposed [76]. Moreover, the introduction of the differential privacy model for disclosure control has motivated the use of other noise distributions. The focus here is on the preservation of the privacy guarantees of the model rather than the statistical properties of the data. The addition of uncorrelated Laplace distributed noise is the most common approach to attain differential privacy [29]. For the case of discrete data, the geometric distribution [33] is a better alternative to the Laplace distributed noise. It has also been shown that the Laplace distribution is not the optimal noise in attaining differential privacy for continuous data [92].

Data/rank swapping

Data swapping was originally presented as an SDC method for databases containing only categorical attributes. The basic idea behind the method is to transform a database by exchanging values of confidential attributes among individual records. Records are exchanged in such a way that low-order frequency counts or marginals are maintained.

In spite of the original procedure not being very used in practice, its basic idea had a clear influence in subsequent methods. A variant of data swapping for microdata is rank swapping, which will be described next in some detail. Although originally described only for ordinal attributes [36], rank swapping can also be used for any numerical attribute. See Algorithm 1. First, values of an attribute A are ranked in ascending order, then each ranked value of A is swapped with another ranked value randomly chosen within a restricted range (e.g., the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter).

This algorithm is independently used on each original attribute in the original data set. It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap.

Microaggregation

Microaggregation is a family of SDC techniques for continuous microdata. The rationale behind microaggregation is that confidentiality rules in use allow publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates (i.e., contributes too much to) the group and k is a threshold value. Strict application of such confidentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microaggregation. To obtain microaggregates in a microdata set with n records, these are combined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original averaged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be published. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within-group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as:

Algorithm 1 Rank swapping with swapping restricted to p%

The between-groups sum of squares SSA is

The total sum of squares is SST = SSA C SSE or explicitly

The lower the SSE, the higher the within group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE (or equivalently, maximizes SSA).

Given a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes and then microaggregated. Also, the way to form groups may vary. Several taxonomies are possible to classify the microaggregation algorithms in the literature: (i) fixed group size [17, 26, 45] vs. variable group size [21, 49, 57]; (ii) exact optimal (only for the univariate case, [41]) vs. heuristic microaggregation (the rest of the microaggregation literature); (iii) categorical [26, 57] vs. continuous (the rest of the references cited in this paragraph). Also, depending on whether they deal with one or several attributes at a time, microaggregation methods can be classified into univariate and multivariate.

• Univariate methods deal with multi-attribute data sets by microaggregating one attribute at a time. Input records are sorted by the first attribute, then groups of successive k values of the first attribute are created and all values within that group are replaced by the group representative (e.g., centroid). The same procedure is repeated for the rest of the attributes. Notice that all attribute values of each record are moved together when sorting records by a particular attribute; hence, the relation between the attribute values within each record is preserved. This approach is known as individual ranking [16, 17]. Individual ranking just reduces the variability of attributes, thereby providing some anonymization. In [25] it was shown that individual ranking causes low information loss and, thus, its output better preserves analytical utility. However, the disclosure risk in the anonymized output remains unacceptably high [22].

• To deal with several attributes at a time, the trivial option is to map multi-attribute data sets to univariate data by projecting the former onto a single axis (e.g., using the sum of z-scores or the first principal component, see [16]) and then use univariate microaggregation on the univariate data. Another option avoiding the information loss due to single-axis projection is to use multivariate microaggregation able to deal with unprojected multi-attribute data [21]. If we define optimal microaggregation as finding a partition in groups of size at least k such that within-groups homogeneity is maximum, it turns out that, while optimal univariate microaggregation can be solved in polynomial time [41], unfortunately optimal multivariate microaggregation is NP-hard [70]. This justifies the use of heuristic methods for multivariate microaggregation, such as the MDAV (Maximum Distance to Average Vector, [20, 26]). In any case, multivariate microaggregation leads to higher information loss than individual ranking [25].

To illustrate these approaches, we next give the details of the MDAV heuristic algorithm for multivariate fixed group size microaggregation on unprojected continuous data.

1. Compute the average record x̄ of all records in the data set. Consider the most distant record xr to the average record x̄ (using the squared Euclidean distance).

2. Find the most distant record xs from the record xr considered in the previous step.

3. Form two groups around xr and xs, respectively. One group contains xr and the k – 1 records closest to xr. The other group contains xs and the k – 1 records closest to xs.

4. If there are at least 3k records which do not belong to any of the two groups formed in Step 3, go to Step 1 taking as new data set the previous data set minus the groups formed in the last instance of Step 3.

5. If there are between 3k – 1 and 2k records which do not belong to any of the two groups formed in Step 3: a) compute the average record x of the remaining records; b) find the most distant record xr from x̄ c) form a group containing xr and the k – 1 records closest to xr; d) form another group containing the rest of records. Exit the algorithm.

Подняться наверх