Читать книгу Database Anonymization - David Sánchez - Страница 11
ОглавлениеCHAPTER 1
Introduction
The current social and economic context increasingly demands open data to improve planning, scientific research, market analysis, etc. In particular, the public sector is pushed to release as much information as possible for the sake of transparency. Organizations releasing data include national statistical institutes (whose core mission is to publish statistical information), healthcare authorities (which occasionally release epidemiologic information) or even private organizations (which sometimes publish consumer surveys). When published data refer to individual respondents, care must be exerted for the privacy of the latter not to be violated. It should be de facto impossible to relate the published data to specific individuals. Indeed, supplying data to national statistical institutes is compulsory in most countries but, in return, these institutes commit to preserving the privacy of the respondents. Hence, rather than publishing accurate information for each individual, the aim should be to provide useful statistical information, that is, to preserve as much as possible in the released data the statistical properties of the original data.
Disclosure risk limitation has a long tradition in official statistics, where privacy-preserving databases on individuals are called statistical databases. Inference control in statistical databases, also known as Statistical Disclosure Control (SDC), Statistical Disclosure Limitation (SDL), database anonymization, or database sanitization, is a discipline that seeks to protect data so that they can be published without revealing confidential information that can be linked to specific individuals among those to whom the data correspond.
Disclosure limitation has also been a topic of interest in the computer science research community, which refers to it as Privacy Preserving Data Publishing (PPDP) and Privacy Preserving Data Mining (PPDM). The latter focuses on protecting the privacy of the results of data mining tasks, whereas the former focuses on the publication of data of individuals.
Whereas both SDC and PPDP pursue the same objective, SDC proposes protection mechanisms that are more concerned with the utility of the data and offer only vague (i.e., ex post) privacy guarantees, whereas PPDP seeks to attain an ex ante privacy guarantee (by adhering to a privacy model), but offers no utility guarantees.
In this book we provide an exhaustive overview of the fundamentals of privacy in data releases, including privacy models, anonymization/SDC methods, and utility and risk metrics that have been proposed so far in the literature. Moreover, as a more advanced topic, we discuss in detail the connections between several proposed privacy models (how to accumulate the guarantees offered by different privacy models to achieve more robust protection and when are such guarantees equivalent or complementary). We also propose bridges between SDC methods and privacy models (i.e., how specific SDC methods can be used to satisfy specific privacy models and thereby offer ex ante privacy guarantees).
The book is organized as follows.
• Chapter 2 details the basic notions of privacy in data releases: types of data releases, privacy threats and metrics, and families of SDC methods.
• Chapter 3 offers a comprehensive overview of SDC methods, classified into perturbative and non-perturbative ones.
• Chapter 4 describes how disclosure risk can be empirically quantified via record linkage.
• Chapter 5 discusses the well-known k-anonymity privacy model, which is focused on preventing re-identification of individuals, and details which data protection mechanisms can be used to enforce it.
• Chapter 6 describes two extensions of k-anonymity (l-diversity and t-closeness) focused on offering protection against attribute disclosure.
• Chapter 7 presents in detail how t-closeness can be attained on top of k-anonymity by relying on data microaggregation (i.e., a specific SDC method based on data clustering).
• Chapter 8 describes the differential privacy model, which mainly focuses on providing sanitized answers with robust privacy guarantees to specific queries. We also explain SDC techniques that can be used to attain differential privacy. We also discuss in detail the relationship between differential privacy and k-anonymity-based models (t-closeness, specifically).
• Chapters 9 and 10 present two state-of-the-art approaches to offer utility-preserving differentially private data releases by relying on the notion of k-anonymous data releases and on multivariate and univariate microaggregation, respectively.
• Chapter 11 summarizes general conclusions and introduces some topics for future research. More specific conclusions are given at the end of each chapter.