Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 30

2.2.1 Input Noise Infusion

Protection mechanisms for microdata are often similar in spirit, though not in their details, to the methods employed for tabular data. Consider coarsening, in which the more detailed response to a question (say, about income), is classified into a much smaller set of bins (for instance, income categories such as “[10 000; 25 000]”). In fact, many tables can be viewed as a coarsening of the underlying microdata, with a subsequent count of the coarsened cases.

Many microdata methods are based on input noise infusion: distorting the value of some or all of the inputs before any publication data are built. The Census Bureau uses this technique before building publication tables for many of its business establishment products and in the American Community Survey (ACS) publications, and we will discuss it in more detail for one of those data products later in this chapter. The noise infusion parameters can be set such that all of the published statistics are formally unbiased – the expected value of the published statistic equals the value of the confidential statistic with respect to the probability distribution of the infused noise – or nearly so. Hence, the disclosure risk and data quality can be conveniently summarized by two parameters: one measuring the absolute distortion in the data inputs and the other measuring the mean squared error of publication statistics (either overall for censuses or relative to the undistorted survey estimates).

From the viewpoint of empirical social sciences, however, all input distortion systems with the same risk-quality parameters are not equivalent. In a regression discontinuity design, for example, there will now be a window around the break point in the running variable that reflects the uncertainty associated with the noise infusion. If the effect is not large enough, it will be swamped by noise even though all the inputs to the analysis are unbiased, or nearly so. Once again, using the unmodified confidential data via a restricted access agreement does not completely solve the problem because once the noisy data have been published, the agency has to consider the consequences of allowing the publication of a clean regression discontinuity design estimate where the plot of the unprotected outcomes versus the running variable can be compared to the similar plot produced from the public noisy data.

An even more invasive input noise technique is data swapping. Sensitive data records (usually households) are identified based on a priori criteria. Then, sensitive records are compared to “nearby” records on the basis of a few variables. If there is a match, the values of some or all of the other variables are swapped (usually the geographic identifiers, thus effectively relocating the records in each other’s location). The formal theory of data swapping was developed shortly after the theory of primary/complementary suppression (Dalenius and Reiss 1982, first presented at American Statistical Association (ASA) Meetings in 1978). Basically, the marginal distribution of the variables used to match the records is preserved at the cost of all joint and conditional distributions involving the swapped variables. In general, very little is published about the swapping rates, the matching variables, or the definition of “nearby,” making analysis of the effects of this protection method very difficult. Furthermore, even arrangements that permit restricted access to the confidential files still require the use of the swapped data. Some providers destroy the unswapped data. Data swapping is used by the Census Bureau, NCHS, and many other agencies (FCSM 2005). The Census Bureau does not allow analysis of the unswapped decennial and ACS data except under extraordinary circumstances that usually involve the preparation of linked data from outside sources then reimposition of the original swap (so the records acquire the correct linked information, but the geographies are swapped according to the original algorithm before any analysis is performed). NCHS allows the use of unswapped data in its restricted access environment but prohibits publication of most subnational geographies when the research is published.

The basic problem for empirical social scientists is that agencies must have a general purpose data publication strategy in order to provide the public good that is the reason for incurring the cost of data collection in the first place. But this publication strategy inherently advantages certain analyses over others. Statisticians and computer scientists have developed two related ways to address this problem: synthetic data combined with validation servers and privacy-protected query systems. Statisticians define “synthetic data” as samples from the joint probability distribution of the confidential data that are released for analysis. After the researcher analyzes the synthetic data, the validation server is used to repeat some or all of the analyses on the underlying confidential data. Conventional SDL methods are used to protect the statistics released from the validation server.

Administrative Records for Survey Methodology

Подняться наверх