Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 20
1.2.2 Measurement
ОглавлениеConsider now the measurement side in Figure 1.1. Relevance error refers to the discrepancy between the target measure that may be a theoretical construct and the measure that is achievable based on the available data. In a widespread scenario for combining register and survey data, the survey variable is treated as the target measure and the register proxy an auxiliary variable, which can be used either to adjust the survey sampling weights or to build a prediction model of the survey variable.
Sometimes, however, all the available measures entail relevance error, regardless of the source of the data, and there does not exist a way in which they can be combined to derive the target measure directly. For instance, Meijer, Rohwedder, and Wansbeek (2012) adopt such a viewpoint and study earnings data in register and survey using a mixture model approach, whereas Pavlopoulos and Vermunt (2015) apply latent class models to analyze income-based labor market mobility. It is also possible to formulate an adjusted measure as the solution of an appropriately defined constrained optimization problem, without explicitly introducing a model that spells out the relationship between the true measure and the observed proxy measures. For instance, Mushkudiani, Daalmans, and Pannekoek (2014) apply such an approach to Census aggregated tables and turnover variable from different sources.
Mapping error due to reclassification of input register data is highly common, since a register proxy variable often arises by means of reclassification. For instance, inferring the mother tongue from birth country is reclassification of the input variable birth country to the outcome variable mother tongue. For another example, to classify someone receiving unemployment benefit as unemployed is to reclassify the input variable benefit or not to the outcome variable unemployed or not. Examples as such are numerous.
It is worth noting that mapping error may be caused by delays or mistakes in the administrative sources, even where reclassification has no conceptual difficulties. Register data may be progressive in the sense that the observations for a particular reference time point may differ depending on when the observations are compiled. According to Zhang and Fosen (2012) and Zhang and Pritchard (2013), let t be the reference time point of interest and t + d the measurement time point, for d ≥ 0. Let U(t) and y(t) be the target population and value at t, respectively. For a unit i, let Ii(t; t + d) = 1 if it is to be included in the target population and 0 otherwise, based on the register data available at t + d, and let yi(t; t + d) be the observed value for t at t + d. The data are said to be progressive if, for d ≠ d ′ > 0, one can have Ii(t; t + d) ≠ Ii(t; t + d′) and yi(t; t + d) ≠ yi(t; t + d′). Progressiveness is a distinct feature of register data compared to survey data.
The observed proxy measures may need to be adjusted in order to satisfy micro- as well as macro-level constraints, so as to resolve incompatibility across the data sources. For instance, register data from corporate tax returns may be used to impute for the missing items in Structural Business Survey. If this results in numerical inconsistency with the items observed from the survey, then imputation or adjustment of some of the items will be necessary in order to produce a clean and coherent dataset. See e.g. Pannekoek, Shlomo, and DeWaal (2013) and Pannekoek and Zhang (2015) for relevant instances.
Imposing macro-level survey estimates as benchmarks, when micro-adjusting a register proxy variable, can be regarded as a means to achieve statistical relevance at the level where the unbiased benchmarks are introduced (Zhang and Giusti 2016), though one is unable to remove the relevance bias at the micro-level. The Norwegian register-based employment status provides an example of such uses of proxy variables. Initially, the register proxy variable is rule-processed based on several input administrative registers, covering employee benefit, self-employment, tax, military or civilian service, leave of absence, etc. This results in the tripartition of the target population: (I) the compatible part, where the register data are compatible across the sources and allow for unequivocal reclassification accordingly, (II) the resolved part, where reclassification can be determined after making room for administrative regulations and progressiveness of the data, (III) the unsolved part, where register data are either lacking or incompatible, beyond what can be rule-processed. The Labor Force Survey (LFS) estimate of the yearly total of employed is then introduced to define an income threshold in the different subsets of part (III), whereby everyone above the threshold is reclassified as employed, such that the register total of employed coincides with the LFS estimate. As shown by Fosen and Zhang (2011), the resulting adjusted register proxy variable entails smaller mean squared error at the municipality level, compared to the survey estimates where the register proxy is used as an auxiliary variable.