Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 29
2.2 Paradigms of Protection
ОглавлениеThere are no methods for disclosure limitation and confidentiality protection specifically designed for linked data. Protecting data constructed by linking administrative records, survey responses, and “found” transaction records relies on the same methods as might be applied to each source individually. It is the richness inherent in the linkages, and in the administrative information available to some potential intruders, that pose novel challenges.
Statistical confidentiality can be viewed as “a body of principles, concepts, and procedures that permit confidentiality to be afforded to data, while still permitting its use of for statistical purposes” (Duncan, Elliot, and Salazar-González 2011, p. 2). In order to protect the confidentiality of the data they collect, NSOs and survey organizations (henceforth referred to generically as data custodians) employ many methods. Very often, data are released to the public as tabular summaries. Many of the protection mechanisms in use today evolved to protect published tables against disclosure. Generically, the idea is to limit the publication of cells with “too few” respondents, where the notion of “too few” is assessed heuristically.
We will not provide a detailed history or taxonomy of statistical disclosure limitation (SDL) and formal privacy models, instead will refer the reader to other publications on the topic (Duncan, Elliot, and Salazar-González 2011; Dwork and Roth 2014; FCSM 2005). We do need to set up the problem, which we will do by reviewing suppression, coarsening, swapping, and noise infusion (input and output). These are widely used techniques and the main issues that arise in applications to linked data can be understood with reference to these methods.
Suppression is widely used to protect published tables against statistical disclosure. Suppression describes the removal of sub-tables, cells, or items in a cell from a published collection of tables if the item’s publication would pose a high risk of disclosure. This method attempts to forge a middle ground between the users of tabular summaries, who want increasingly detailed disaggregation, and publication rules based on cell count thresholds. The Bureau of Labor Statistics (BLS) uses suppression as its primary SDL technique for data releases based on business establishment censuses and surveys. From the outset, it was understood that primary suppression – not publishing easily identified data items – did not protect anything if the agency published the rest of the data, including summary statistics. Users could infer the missing items from what was published (Fellegi 1972). The BLS, and other agencies that rely on suppression, make “complementary suppressions” to reduce the probability that a user can infer the sensitive items from the published data (Holan et al. 2010). But there is no optimal complementary suppression technology – there are usually multiple complementary suppression strategies that achieve the same protection.
Researchers, however, are not indifferent to these strategies. A researcher who needs detailed geographic variation will benefit from data in which the complementary suppressions are based on removing detailed industries. A researcher who needs detailed industry variation will prefer data with complementary suppression based on geography. Ultimately, the committee that chooses the complementary suppression strategy will determine which research uses are possible and which are ruled out.
But the problem is deeper than this: suppression is a very ineffective SDL technique. Researchers working with the cooperation of the BLS have shown that the suppression strategy used in major BLS business data publications provides almost no protection if it is applied, as is currently the case, to each data release separately (Holan et al. 2010). Some agencies may use cumulative suppression strategies in their sequential data releases. In this case, once an item has been designated for either primary or complementary suppression, it would disappear from the release tables until the entire product is redesigned.
Many social scientists believe that suppression can be complemented by restricted access agreements that allow the researcher to use all of the confidential data but limit what can be published from the analysis. Such a strategy is not a complete solution because SDL must still be applied to the output of the analysis, which quickly brings the problem of which output to suppress back to the forefront.
Custom tabulations and data enclaves. Another traditional response by data custodians to the demand by researchers for more extensive and detailed summaries of confidential data, was to create a custom tabulation, a table not previously published, but generated by data custodian staff with access rights to the confidential data, and typically subject to the same suppression rules. As these requests increased, the tabulation and analysis work was offloaded onto researchers by providing them with access to protected microdata. This approach has expanded rapidly in the last two decades, and is widely used around the world. We discuss it in detail later in this chapter.
Coarsening is a method for protecting data that involves mapping confidential values into broader categories. The simplest method is a histogram, which maps values into (fixed) intervals. Intuitively, the broader the interval, the more protection is provided.
Sampling is a protection mechanism that can be applied either at the collection stage or at the data publication stage. At the collection stage, it is a natural part of conducting surveys. In combination with coarsening and the use of statistical weights, the basic idea is simple: if a table cell is based on only a few sampled individuals which collectively represent the underlying population, then statistical inference will not reveal the attributes of any particular individual with any precision, as long as the identity of the sampled individuals is not revealed. Both coarsening and sampling underlie the release of public use microdata samples.