Читать книгу Database Anonymization - David Sánchez - Страница 12
ОглавлениеCHAPTER 2
Privacy in Data Releases
References to privacy were already present in the writings of Greek philosophers when they distinguish the outer (public) from the inner (private). Nowadays privacy is considered a fundamental right of individuals [34, 101]. Despite this long history, the formal description of the “right to privacy” is quite recent. It was coined by Warren and Brandeis, back in 1890, in an article [103] published in the Harvard Law Review. These authors presented laws as dynamic systems for the protection of individuals whose evolution is triggered by social, political, and economic changes. In particular, the conception of the right to privacy is triggered by the technical advances and new business models of the time. Quoting Warren and Brandeis:
Instantaneous photographs and newspaper enterprise have invaded the sacred precincts of private and domestic life; and numerous mechanical devices threaten to make good the prediction that what is whispered in the closet shall be proclaimed from the house-tops.
Warren and Brandeis argue that the “right to privacy” was already existent in many areas of the common law; they only gathered all these sparse legal concepts, and put them into focus under their common denominator. Within the legal framework of the time, the “right to privacy” was part of the right to life, one of the three fundamental individual rights recognized by the U.S. constitution.
Privacy concerns revived again with the invention of computers [31] and information exchange networks, which skyrocketed information collection, storage and processing capabilities. The generalization of population surveys was a consequence. The focus was then on data protection.
Nowadays, privacy is widely considered a fundamental right, and it is supported by international treaties and many constitutional laws. For example, the Universal Declaration of Human Rights (1948) devotes its Article 12 to privacy. In fact, privacy has gained worldwide recognition and it applies to a wide range of situations such as: avoiding external meddling at home, limiting the use of surveillance technologies, controlling processing and dissemination of personal data, etc.
As far as the protection of individuals’ data is concerned, privacy legislation is based on several principles [69, 101]: collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability. Although, with the appearance of big data, it is unclear if any of these principles is really effective [93].
Among all the aspects that relate to data privacy, we are especially interested in data dissemination. Dissemination is, for instance, the primary task of National Statistical Institutes. These aim at offering an accurate picture of society; to that end, they collect and publish statistical data on a wide range of aspects such as economy, population, etc. Legislation usually assimilates privacy violations in data dissemination to individual identifiability [1, 2]; for instance, Title 13, Chapter 1.1 of the U.S. Code states that “no individual should be re-identifiable in the released data.”
For a more comprehensive review of the history of privacy, check [43]. A more visual perspective of privacy is given by the timelines [3, 4]. In [3] key privacy-related events between 1600 (when it was a civic duty to keep an eye on your neighbors) and 2008 (after the U.S. Patriot Act and the inception of Facebook) are listed. In [4] key moments that have shaped privacy-related laws are depicted.
2.1 TYPES OF DATA RELEASES
The type of data being released determines the potential threats to privacy as well as the most suitable protection methods. Statistical databases come in three main formats.
• Microdata. The term “microdata” refers to a record that contains information related to a specific individual (a citizen or a company). A microdata release aims at publishing raw data, that is, a set of microdata records.
• Tabular data. Cross-tabulated values showing aggregate values for groups of individuals are released. The term contingency (or frequency) table is used when counts are released, and the term “magnitude table” is used for other aggregate magnitudes. These types of data is the classical output of official statistics.
• Queryable databases, that is, interactive databases to which the user can submit statistical queries (sums, averages, etc.).
Our focus in subsequent chapters is on microdata releases. Microdata offer the greatest level of flexibility among all types of data releases: data users are not confined to a specific prefixed view of data; they are able to carry out any kind of custom analysis on the released data. However, microdata releases are also the most challenging for the privacy of individuals.
2.2 MICRODATA SETS
A microdata set can be represented as a table (matrix) where each row refers to a different individual and each column contains information regarding one of the attributes collected. We use X to denote the collected microdata file. We assume that X contains information about n respondents and m attributes. We use xi, to refer to the record contributed by respondent i, and xj (or Xj) to refer to attribute j. The value of attribute j for respondent i is denoted by .
The attributes in a microdata set are usually classified in the following non-exclusive categories.
• Identifiers. An attribute is an identifier if it provides unambiguous re-identification of the individual to which the record refers. Some examples of identifier attributes are the social security number, the passport number, etc. If a record contains an identifier, any sensitive information contained in other attributes may immediately be linked to a specific individual. To avoid direct re-identification of an individual, identifier attributes must be removed or encrypted. In the following chapters, we assume that identifier attributes have previously been removed.
• Quasi-identifiers. Unlike an identifier, a quasi-identifier attribute alone does not lead to record re-identification. However, in combination with other quasi-identifier attributes, it may allow unambiguous re-identification of some individuals. For example, [99] shows that 87% of the population in the U.S. can be unambiguously identified by combining a 5-digit ZIP code, birth date, and sex. Removing quasi-identifier attributes, as proposed for the identifiers, is not possible, because quasi-identifiers are most of the time required to perform any useful analysis of the data. Deciding whether a specific attribute should be considered a quasi-identifier is a thorny issue. In practice, any information an intruder has about an individual can be used in record re-identification. For uninformed intruders, only the attributes available in an external non-anonymized data set should be classified as quasi-identifiers; in the presence of informed intruders any attribute may potentially be a quasi-identifier. Thus, in the strictest case, to make sure all potential quasi-identifiers have been removed, one ought to remove all attributes (!).
• Confidential attributes. Confidential attributes hold sensitive information on the individuals that took part in the data collection process (e.g., salary, health condition, sex orientation, etc.). The primary goal of microdata protection techniques is to prevent intruders from learning confidential information about a specific individual. This goal involves not only preventing the intruder from determining the exact value that a confidential attribute takes for some individual, but also preventing accurate inferences on the value of that attribute (such as bounding it).
• Non-confidential attributes. Non-confidential attributes are those that do not belong to any of the previous categories. As they do not contain sensitive information about individuals and cannot be used for record re-identification, they do not affect our discussion on disclosure limitation for microdata sets. Therefore, we assume that none of the attributes in X belong to this category.
2.3 FORMALIZING PRIVACY
A first attempt to come up with a formal definition of privacy was made by Dalenius in [14]. He stated that access to the released data should not allow any attacker to increase his knowledge about confidential information related to a specific individual. In other words, the prior and the posterior beliefs about an individual in the database should be similar. Because the ultimate goal in privacy is to keep the secrecy of sensitive information about specific individuals, this is a natural definition of privacy. However, Dalenius’ definition is too strict to be useful in practice. This was illustrated with two examples [29]. The first one considers an adversary whose prior view is that everyone has two left feet. By accessing a statistical database, the adversary learns that almost everybody has one left foot and one right foot, thus modifying his posterior belief about individuals to a great extent. In the second example, the use of auxiliary information makes things worse. Suppose that a statistical database teaches the average height of a group of individuals, and that it is not possible to learn this information in any other way. Suppose also that the actual height of a person is considered to be a sensitive piece of information. Let the attacker have the following side information: “Adam is one centimeter taller than the average English man.” Access to the database teaches Adam’s height, while having the side information but no database access teaches much less. Thus, Dalenius’ view of privacy is not feasible in presence of background information (if any utility is to be provided).
The privacy criteria used in practice offer only limited disclosure control guarantees. Two main views of privacy are used for microdata releases: anonymity (it should not be possible to re-identify any individual in the published data) and confidentiality or secrecy (access to the released data should not reveal confidential information related to any specific individual).
The confidentiality view of privacy is closer to Dalenius’ proposal, being the main difference that it limits the amount of information provided by the data set rather than the change between prior and posterior beliefs about an individual. There are several approaches to attain confidentiality. A basic example of SDC technique that gives confidentiality is noise addition. By adding a random noise to a confidential data item, we mask its value: we report a value drawn from a random distribution rather than the actual value. The amount of noise added determines the level of confidentiality.
The anonymity view of privacy seeks to hide each individual in a group. This is indeed quite intuitive a view of privacy: the privacy of an individual is protected if we are not able to distinguish her from other individuals in a group. This view of privacy is commonly used in legal frameworks. For instance, the U.S. Health Insurance Portability and Accountability Act (HIPAA) of 1996 requires removing several attributes that could potentially identify an individual; in this way, the individual stays anonymous. However, we should keep in mind that if the value of the confidential attribute has a small variability within the group of indistinguishable individuals, disclosure still happens for these individuals: even if we are not able to tell which record belongs to each of the individuals, the low variability of the confidential attribute gives us a good estimation of its actual value.
The Health Insurance Portability and Accountability Act (HIPAA)
The Privacy Rule allows a covered entity to de-identify data by removing all 18 elements that could be used to identify the individual or the individual’s relatives, employers, or household members; these elements are enumerated in the Privacy Rule. The covered entity also must have no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual who is the subject of the information. Under this method, the identifiers that must be removed are the following:
• Names.
• All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geographical codes, except for the initial three digits of a ZIP code if, according to the current publicly available data from the Bureau of the Census:
– The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people.
– The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people are changed to 000.
• All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.
• Telephone numbers.
• Facsimile numbers.
• Electronic mail addresses.
• Social security numbers.
• Medical record numbers.
• Health plan beneficiary numbers.
• Account numbers.
• Certificate/license numbers.
• Vehicle identifiers and serial numbers, including license plate numbers.
• Device identifiers and serial numbers.
• Web universal resource locators (URLs).
• Internet protocol (IP) address numbers.
• Biometric identifiers, including fingerprints and voiceprints.
• Full-face photographic images and any comparable images.
• Any other unique identifying number, characteristic, or code, unless otherwise permitted by the Privacy Rule for re-identification.
2.4 DISCLOSURE RISK IN MICRODATA SETS
When publishing a microdata file, the data collector must guarantee that no sensitive information about specific individuals is disclosed. Usually two types of disclosure are considered in microdata sets [44].
• Identity disclosure. This type of disclosure violates privacy viewed as anonymity. It occurs when the intruder is able to associate a record in the released data set with the individual that originated it. After re-identification, the intruder associates the values of the confidential attributes for the record to the re-identified individual. Two main approaches are usually employed to measure identity disclosure risk: uniqueness and reidentification.
– Uniqueness. Roughly speaking, the risk of identity disclosure is measured as the probability that rare combinations of attribute values in the released protected data are indeed rare in the original population the data come from.
– Record linkage. This is an empirical approach to evaluate the risk of disclosure. In this case, the data protector (also known as data controller) uses a record linkage algorithm (or several such algorithms) to link each record in the anonymized data with a record in the original data set. Since the protector knows the real correspondence between original and anonymized records, he can determine the percentage of correctly linked pairs, which he uses to estimate the number of re-identifications that might be obtained by a specialized intruder. If this number is unacceptably high, then more intense anonymization by the controller is needed before the anonymized data set is ready for release.
• Attribute disclosure. This type of disclosure violates privacy viewed as confidentiality. It occurs when access to the released data allows the intruder to determine the value of a confidential attribute of an individual with enough accuracy.
The above two types of disclosure are independent. Even if identity disclosure happens, there may not be attribute disclosure if the confidential attributes in the released data set have been masked. On the other side, attribute disclosure may still happen even without identity disclosure. For example, imagine that the salary is one of the confidential attributes and the job is a quasi-identifier attribute; if an intruder is interested in a specific individual whose job he knows to be “accountant” and there are several accountants in the data set (including the target individual), the intruder will be unable to re-identify the individual’s record based only on her job, but he will be able to lower-bound and upper-bound the individual’s salary (which lies between the minimum and the maximum salary of all the accountants in the data set). Specifically, attribute disclosure happens if the range of possible salary values for the matching records is narrow.
2.5 MICRODATA ANONYMIZATION
To avoid disclosure, data collectors do not publish the original microdata set X, but a modified version Y of it. This data set Y is called the protected, anonymized, or sanitized version of X. Microdata protection methods can generate the protected data set by either masking the original data or generating synthetic data.
• Masking. The protected data Y are generated by modifying the original records in X. Masking induces a relation between the records in Y and the original records in X. When applied to quasi-identifier attributes, the identity behind each record is masked (which yields anonymity). When applied to confidential attributes, the values of the confidential data are masked (which yields confidentiality, even if the subject to whom the record corresponds might still be re-identifiable). Masking methods can in turn be divided in two categories depending on their effect on the original data.
– Perturbative masking. The microdata set is distorted before publication. The perturbation method used should be such that the statistics computed on the perturbed data set do not differ significantly from the statistics that would be obtained on the original data set. Noise addition, microaggregation, data/rank swapping, microdata rounding, resampling, and PRAM are examples of perturbative masking methods.
– Non-perturbative masking. Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail/coarsening in the original data set. Sampling, global recoding, top and bottom coding, and local suppression are examples of non-perturbative masking methods.
– Fully synthetic [77], where every attribute value for every record has been synthesized. The population units (subjects) contained in Y are not the original population units in X but a new sample from the underlying population.
• Synthetic data. The protected data set Y consists of randomly simulated records that do not directly derive from the records in X; the only connection between X and Y is that the latter preserves some statistics from the former (typically a model relating the attributes in X). The generation of a synthetic data set takes three steps [27, 77]: (i) a model for the population is proposed, (ii) the model is adjusted to the original data set X, and (iii) the synthetic data set Y is generated by drawing from the model. There are three types of synthetic data sets:
– Partially synthetic [74], where only the data items (the attribute values) with high risk of disclosure are synthesized. The population units in Y are the same population units in X (in particular, X and Y have the same number of records).
– Hybrid [19, 65], where the original data set is mixed with a fully synthetic data set.
In a fully synthetic data set any dependency between X and Y must come from the model. In other words, X and Y are independent conditionally to the adjusted model. The disclosure risk in fully synthetic data sets is usually low, as we justify next. On the one side, the population units in Y are not the original population units in X. On the other side, the information about the original data X conveyed by Y is only the one incorporated by the model, which is usually limited to some statistical properties. In a partially synthetic data set, the disclosure risk is reduced by replacing the values in the original data set at a higher risk of disclosure with simulated values. The simulated values assigned to an individual should be representative but are not directly related to her. In hybrid data sets, the level of protection we get is the lowest; mixing original and synthetic records breaks the conditional independence between the original data and the synthetic data. The parameters of the mixture determine the amount of dependence.
2.6 MEASURING INFORMATION LOSS
The evaluation of the utility of the protected data set must be based on the intended uses of the data. The closer the results obtained for these uses between the original and the protected data, the more utility is preserved. However, very often, microdata protection cannot be performed in a data use specific manner, due to the following reasons.
• Potential data uses are very diverse and it may even be hard to identify them all at the moment of the data release.
• Even if all the data uses could be identified, releasing several versions of the same original data set so that the i-th version has been optimized for the i-th data use may result in unexpected disclosure.
Since data must often be protected with no specific use in mind, it is usually more appropriate to refer to information loss rather than to utility. Measures of information loss provide generic ways for the data protector to assess how much harm is being inflicted to the data by a particular data masking technique.
Information loss measures for numerical data. Assume a microdata set X with n individuals (records) x1,…,xn and m continuous attributes x1,…,xm. Let Y be the protected microdata set. The following tools are useful to characterize the information contained in the data set:
• Covariance matrices V (on X) and V′ (on Y).
• Correlation matrices R and R′.
• Correlation matrices RF and RF′ between the m attributes and the m factors PC1, PC2,…,PCp obtained through principal components analysis.
• Communality between each of the m attributes and the first principal component PC1 (or other principal components PCi’s). Communality is the percent of each attribute that is explained by PC1 (or PCi). Let C be the vector of communalities for X, and C′ the corresponding vector for Y.
• Factor score coefficient matrices F and F′. Matrix F contains the factors that should multiply each attribute in X to obtain its projection on each principal component. F′ is the corresponding matrix for Y.
There does not seem to be a single quantitative measure which completely reflects the structural differences between X and Y. Therefore, in [25, 87] it was proposed to measure the information loss through the discrepancies between matrices X, V, R, RF, C, and F obtained on the original data and the corresponding X′, V′, R′, RF′, C′, and F′ obtained on the protected data set. In particular, discrepancy between correlations is related to the information loss for data uses such as regressions and cross-tabulations. Matrix discrepancy can be measured in at least three ways.
• Mean square error. Sum of squared componentwise differences between pairs of matrices, divided by the number of cells in either matrix.
• Mean absolute error. Sum of absolute componentwise differences between pairs of matrices, divided by the number of cells in either matrix.
• Mean variation. Sum of absolute percent variation of components in the matrix computed on the protected data with respect to components in the matrix computed on the original data, divided by the number of cells in either matrix. This approach has the advantage of not being affected by scale changes of attributes.
Information loss measures for categorical data. These have been usually based on direct comparison of categorical values, comparison of contingency tables, or on Shannon’s entropy [25]. More recently, the importance of the semantics underlying categorical data for data utility has been realized [60, 83]. As a result, semantically grounded information loss measures that exploits the formal semantics provided by structured knowledge sources (such as taxonomies or ontologies) have been proposed both to measure the practical utility and to guide the sanitization algorithms in terms of the preservation of data semantics [23, 57, 59].
Bounded information loss measures. The information loss measures discussed above are unbounded, i.e., they do not take values in a predefined interval. On the other hand, as discussed below, disclosure risk measures are naturally bounded (the risk of disclosure is naturally bounded between 0 and 1). Defining bounded information loss measures may be convenient to enable the data protector to trade off information loss against disclosure risk. In [61], probabilistic information loss measures bounded between 0 and 1 are proposed for continuous data.
Propensity scores: a global information loss measure for all types of data. In [105], an information loss measure U applicable to continuous and categorical microdata was proposed. It is computed as follows.
1. Merge the original microdata set X and the anonymized microdata set Y, and add to the merged data set a binary attribute T with value 1 for the anonymized records and 0 for the original records.
2. Regress T on the rest of attributes of the merged data set and call the adjusted attribute T̂. For categorical attributes, logistic regression can be used.
3. Let the propensity score p̂i of record i of the merged data set be the value of T̂ for record i. Then the utility of Y is high if the propensity scores of the anonymized and original records are similar (this means that, based on the regression model used, anonymized records cannot be distinguished from original records).
4. Hence, if the number of original and anonymized records is the same, say N, a utility measure is
The farther U from 0, the more information loss, and conversely.
2.7 TRADING OFF INFORMATION LOSS AND DISCLOSURE RISK
The goal of SDC to modify data so that sufficient protection is provided at minimum information loss suggests that a good anonymization method is one close to optimizing the trade-off between disclosure risk and information loss. Several approaches have been proposed to handle this tradeoff. Here we discuss SDC scores and R-U maps.
SDC scores
An SDC score is a formula that combines the effects of information loss and disclosure risk in a single figure. Having adopted an SDC score as a good trade-off measure, the goal is to optimize the score value. Following this idea, [25] proposed a score for method performance rating based on the average of information loss and disclosure risk measures. For each method M and parameterization P, the following score is computed:
where IL is an information loss measure, DR is a disclosure risk measure, and Y is the protected data set obtained after applying method M with parameterization P to an original data set X. In [25] IL and DR were computed using a weighted combination of several information loss and disclosure risk measures. With the resulting score, a ranking of masking methods (and their parametrizations) was obtained. Using a score permits regarding the selection of a masking method and its parameters as an optimization problem: a masking method can be applied to the original data file and then a post-masking optimization procedure can be applied to decrease the score obtained (that is, to reduce information loss and disclosure risk). On the negative side, no specific score weighting can do justice to all methods. Thus, when ranking methods, the values of all measures of information loss and disclosure risk should be supplied along with the overall score.
R-U maps
A tool which may be enlightening when trying to construct a score or, more generally, optimize the trade-off between information loss and disclosure risk is a graphical representation of pairs of measures (disclosure risk, information loss) or their equivalents (disclosure risk, data utility). Such maps are called R-U confidentiality maps [28]. Here, R stands for disclosure risk and U for data utility. In its most basic form, an R-U confidentiality map is the set of paired values (R, U) of disclosure risk and data utility that correspond to the various strategies for data release (e.g., variations on a parameter). Such (R, U) pairs are typically plotted in a two-dimensional graph, so that the user can easily grasp the influence of a particular method and/or parameter choice.
2.8 SUMMARY
This chapter has presented a broad overview of disclosure risk limitation. We have identified the privacy threats (identity and/or attribute disclosure), and we have introduced the main families of SDC methods (data masking via perturbative and non-perturbative methods, as well as synthetic data generation). Also, we have surveyed disclosure risk and information loss metrics and we have discussed how risk and information loss can be traded off in view of finding the best SDC method and parameterization.