Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 38
2.3.2.3 Disclosure Avoidance Assessment
ОглавлениеThe link of administrative earnings, benefits and SIPP data adds a significant amount of information to an already very detailed survey and could pose potential disclosure risks beyond those originally managed as part of the regular SIPP public use file disclosure avoidance process. The synthesis of the earnings data meets the IRS disclosure officer’s criteria for properly protecting the federal tax information found in the summary and detailed earnings histories used to create the longitudinal earnings variables.
The Census Bureau Disclosure Review Board at the time of release used two standards for disclosure avoidance in partially synthetic data. First, using the best available matching technology, the percentage of true matches relative to the size of the files should not be excessively large. Second, the ratio of true matches to the total number of matches (true and false) should be close to one-half.
The disclosure avoidance analysis (Abowd, Stinson, and Benedetto 2006) uses the principle that a potential intruder would first try to reidentify the source record for a given synthetic data observation in the existing SIPP public use files. Two distinct matching exercises – one probabilistic (Fellegi and Sunter 1969), one distance-based (Torra, Abowd, and Domingo-Ferrer 2006) – between the synthetic data and the harmonized confidential data were conducted.4 The harmonized confidential data – actual values of the data items as released in the original SIPP public use files – are the equivalent of the best available information for an intruder attempting to reidentify a record in the synthetic data. Successful matches between the harmonized confidential data and the synthetic data represent potential disclosure risks. In practice, the intruder would also need to make another successful link to exogenous data files that contain direct identifiers such as names, addresses, telephone numbers, etc. The results from the experiments are conservative estimates of reidentification risk. For the probabilistic matching, the assessment matched synthetic and confidential files exactly on the unsynthesized variables of gender and marital status, and success of the matching exercise is assessed using a person identifier which is not, in fact, available in the released version of the synthetic data. Without the personid, an intruder would have to compare many more record pairs to find true matches, would not find any more true matches (the true match is guaranteed to be in the blocks being compared), and would almost certainly find more false matches. In fact, the records that can be reidentified represent only a very small proportion (less than 3%) of candidate records, and correct reidentifications are swamped by a sea of false reidentifications (Abowd, Stinson, and Benedetto 2006, p. 6).
In distance-based matching, records between the harmonized confidential and synthetic data are blocked in a similar way, and distances (or similarity scores) are computed for a given confidential record and every synthetic record within a block. The three closest records are declared matches, and the personid again checked to verify how often a true match is obtained. A putative intruder who treated the closest record as a match would correctly link about 1% of all synthetic records, and less than 3% in the worst-case subgroup (Abowd, Stinson, and Benedetto 2006, p. 8).
Figure 2.1 Probability density function of the ramp distribution used in LEHD disclosure avoidance system.