Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 52

2.4.4 Disclosure Avoidance Methods

Оглавление

Data enclaves exist to allow researchers to perform analyses within the restricted environment, and then extract or publish some form of statistical summary that can be released from the secure environment. Generally, these summaries are estimates from a statistical model. In general, model-based output is evaluated in accordance with the same criteria traditionally used for tabular output (minimum number of units within a reporting cell, minimum percentage of global activity within a reporting cell). In contrast to licensing arrangements, which allow researchers to self-monitor, statistical data enclaves have regimented output monitoring, typically by staff of the data provider. Generally, released statistical outputs are registered in some fashion, but documentation of the full provenance chain may be limited.

No systematic attempt has been made, to our knowledge, to measure formally the cumulative privacy impact of model-based releases because the science and technology for doing so are rudimentary. Remote processing facilities, on the other hand, when using automated mechanisms, rely on several practices to reduce the risk of disclosure. First, they limit the scope of possible analyses to those for which the agency has developed safe procedures. The number of times a researcher may request releases may also be limited. Nevertheless, most agencies recognize that this review system does not scale because the infeasibility of a full accounting of all possible query combinations over time. In general, they apply basic disclosure avoidance techniques such as suppression, perturbation, masking, recoding, and bootstrap sampling of the input data to each project separately. Some systems apply automated analysis of log and output files (Schouten and Cigrang 2003), although often a manual review is also included (O’Keefe et al. 2013). Some systems provide for self-monitored release of model results, either under licensing or remote access. There are also limitations on quantity and frequency of self-released results, combined with sampling by human reviewers. More sophisticated tools, such as perturbation or synthesizing of estimated model parameters, have been proposed (Reiter 2003). Finally, such systems require review of the draft research paper before submission to any publication medium including online preprint repositories like ArXiv.org.

All three of the examples of linked data provided in this paper rely on some version of secure data enclaves to provide microdata access to approved researchers. HRS data are made available to tenure-track researchers who sign a data use agreement and provide documentation of a secure local computing environment. An additional option for HRS data is to visit to the Michigan Center on the Demography of Aging data enclave, which makes data accessible to researchers in a physical data enclave at “headquarters,” like many NSOs. More recently, HRS has started to offer secure VDI access to researchers. The confidential data underlying the SSB, and against which validation requests are run, are also available either within the FSRDC network, or by sending validation requests by email to staff at Census headquarters (a form of “remote processing”). LEHD microdata are only available through the FSRDC.

An open question is whether the disclosure risks addressed through physical security measures are greater for linked data. Enabling researchers to measure some of the heuristic disclosure risk such as n cell count or p-percent rule (O’Keefe et al. 2013) becomes more important when any possible combination of k variables (k large) leads to small cells or dominated cells. Even subject matter experts cannot assess these situations a priori.

Administrative Records for Survey Methodology

Подняться наверх