Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 22

1.3.1 Asymmetric Setting

Оглавление

The two most common approaches under the asymmetric-linked setting are survey weighting and prediction modeling, where the register proxy variable is used as an auxiliary variable or a covariate. See e.g. Säarndal, Swensson, and Wretman (1992), for design-based approach to survey weighting that makes use of auxiliary variables; Valliant, Dorfman, and Royall (2000) and Chambers and Clark (2012) for model-based approach to finite population prediction; Rao and Molina (2015) for relevant methods of small area estimation. We make two observations. Firstly, when the overlapping survey variable is deemed necessary despite the presence of a register proxy, the latter is typically the most powerful among all the auxiliary variables when it comes to weighting adjustment and regression modeling. See e.g. Djerf (1997) and Thomsen and Zhang (2001) for the use of register economic activity status in the LFS, and the effects on reducing sampling and nonresponse errors. Secondly, applications to remedy Representation errors are much less common. However see, e.g. survey weighting under dependent sampling for the estimation of coverage errors (Nirel and Glickman 2009), mixed-effects models for assessing register coverage errors (Mancini and Toti 2014), and different misclassification models for register NACE (Van Delden et al. 2016), and register household (Zhang 2011).

The nature of a proxy variable implies a special use that is beyond what is feasible with a non-proxy auxiliary variable, no matter how good an auxiliary it is: provided suitable conditions, it is possible to substitute (or replace) the target measure by the proxy value. However, substitution would only be acceptable for a subset of the units but not all since, had it been acceptable for all the units, one would have had “direct tabulation” instead.

It follows that adjustment, or imputation in the case of a rejected value, will be necessary. Macro-level survey estimates can be imposed as benchmarks to achieve statistical relevance at the corresponding level. Linked datasets are typically not necessary here – recall the Norwegian register-based employment status described earlier. This yields many methods under what may be referred to as the benchmarked adjustment approach for combining register and survey proxy variables under the asymmetric-unlinked setting.

Repeated weighting and constrained (mass) imputation are two common approaches of benchmarked adjustment; see e.g. de Waal (2016) for a discussion. Repeated weighting is a technique initially presented for sample reweighing in the presence of overlapping survey estimates (Renssen and Nieuwenbroek 1997). It has been used for the reconciliation of Dutch virtual census output tables (Houbiers 2004). But it can equally be applied to adjust register datasets so that afterward, e.g. the weighted register proxy total agrees with the valid target totals imposed. This does not require linking the register datasets and the external datasets from which the benchmark totals are obtained. An inconvenience arises in cases where there are multiple proxy variables to be benchmarked and the variables are available for different subsets of units. This may be the case due to partial missing data in a single register file or when merging multiple register files. Some imputation will then be necessary if one would like to have a single set of weights for the whole dataset.

The one-number census imputation provides an example of the alternative imputation-based benchmarked adjustment methods (Brown et al. 1999). In the case of multiple proxy variables observed on different subsets of units, imputation is applied not only to the units with partially missing data, but also to the units with no observed variables at all, or possibly the units with completely observed data. The result is a complete dataset that guarantees numerical consistency for any tabulation across the variables and population domains. Constrained imputation for population datasets are e.g. discussed by Shlomo, de Waal, and Pannekoek (2009) and Zhang (2009a). Methods that incorporate micro-data edit constraints are e.g. studied in Coutinho, de Waal, and Shlomo (2013), Pannekoek, Shlomo, and DeWaal (2013), and Pannenkoek and Zhang (2015). Chambers and Ren (2004) consider a method of benchmarked outlier robust imputation. Obviously, it may be difficult to generate a single population dataset that is fit for all possible statistical uses. de Waal (2016) discusses the use of “repeated imputation.” Notice that there are many relevant works on the generation of benchmarked synthetic populations in Spatial Demography, Econometrics, and Sociology.

The distinction between weighting and imputation can be somewhat blurred when it comes to the adjustment of cross-classified proxy contingency tables, because an adjusted cell count is just the number of individuals with the corresponding cross-classification one would have in an imputed dataset. Take, e.g. a two-way table, where the rows represent population domains at some detailed level, say, by local area and sex-age group, and the columns a composition of interest, say, income class. Let X denote the table based on combining population and tax register data. Let Y(r) denote the known vector of population domain sizes, and let denote the survey-based estimates of population totals by income class, which are the row and column benchmarks of the target table Y, respectively. Starting with X and by means of iterative proportional fitting (IPF) until convergence, one may obtain a table that sums to both Y(r) and marginally. The technique has many applications including small area estimation (Purcell and Kish 1980) and statistical matching (D’Orazio, Di Zio, and Scanu 2006; Zhang 2015a) – more in Section 1.3.2.

A key difference between the asymmetric-linked setting and the asymmetric-unlinked setting discussed above is that, one generally does not expect a benchmarked adjustment method based on unlinked data to yield unbiased results below the level where the benchmarks are imposed. For instance, repeated weighting of Renssen and Nieuwenbroek (1997) can yield design-consistent domain estimates subjected to population benchmark totals, because the overlapping survey variables are both considered as the target measure here and no relevance bias is admitted. However, when the same technique is applied to reweight a register dataset, e.g. with the initial weights all set to 1, one cannot generally claim design or model-based consistency below the level of the imposed benchmarks, regardless of whether the benchmarks themselves are true or unbiased from either the design- or model-based perspective. Similarly, provided suitable assumptions, the one-number census imputation can yield model-consistent estimates below the level of the imposed constraints, because the donor records are taken from the enumerated census records that are considered to provide the target measures. However, the model-consistency would fall apart when the donor pool is a register dataset that suffers from relevance bias, even if all the other “suitable” assumptions are retained. Assessment of the statistical uncertainty associated with benchmarked adjustment is therefore an important research topic. An illustration in the contingency table case will now be given in Section 1.3.2.

Administrative Records for Survey Methodology

Подняться наверх