Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 24
1.3.3 Symmetric Setting
ОглавлениеIn the symmetric setting none of the proxy variables is ideal due to errors of relevance, measurement, or coverage. The two most common approaches under the symmetric-linked setting are capture–recapture methodology for population size estimation and Structural Equation Modeling (SEM) that covers the latent class models mentioned earlier.
Capture–recapture methods that originate from wide-life, social, and medical applications are traditionally used for under-count adjustment. Imagine catching fish in a pond on two separate occasions, where one marks and identifies the fish that happen to be caught on both occasions (i.e. the recaptures). Then, under a number of simplifying assumptions, including independent and constant-probability captures, it becomes possible to estimate the total number of fish in the pond (i.e. the target population), for which the captures on each occasion generally entail undercounts. The method can be generalized to multiple captures to allow for relaxation of the independent assumption. The capture probability can be modeled using covariates to allow for heterogeneity across different subpopulations. See, e.g. Böhning, Van der Heijden, and Bunge (2017), for some recent developments.
Combining survey and register-based enumerations for population size estimation has attracted growing interest in the recent years, under the assumption that none of the sources can yield the true target population enumeration directly. We refer to the Journal of Official Statistics (2015, vol. 31, issue 3) for several useful references in this regard. There is plenty of scope for developing a range of models in order to address the different problems, including erroneous enumerations that are not dealt with in the traditional capture–recapture methodology. The potential impact can be huge if it enables one to produce census-like population statistics without the traditional census.
SEM is often considered to have evolved from the genetic path modeling of Sewall Wright. See, e.g. Kline (2016) for a general introduction. The approach is popular in many social science disciplines that share a common interest in “latent constructs” such as intelligence, attitude, well-being, living standard, and so on. The postulated latent constructs cannot be measured directly and are only manifested through observable indicators. The SEM consists of two main components: the structural model showing potentially causal dependencies among the latent variables, and the measurement model relating the latent variables and their indicators. The approach can be referred to in different ways depending on the continuous-categorical nature of the variables involved, the presence of causality or stochastic process on the latent level, etc.
The SEM approach is applicable under the symmetric-linked setting, where the proxy variables are treated as the indicators of the unobserved target measure. In the context of combining register and survey data, this can serve a number of purposes, including assessing potential relevance bias of proxy measures, detecting and possible treatment of measurement errors in editing and estimation, and statistical analysis of latent relationships using proxy indicators. For examples of data types that have been studied recently, see e.g. Pavlopoulos and Vermunt (2015) for temporary employment, Guarnera and Varriale (2015) for labor cost, and Burger et al. (2015) for turnover.
Di Cecco et al. (2018) apply latent class models for population size estimation based on multiple register enumerations that entail both over and under-counts. It is intriguing to notice the connection with some recent developments in record linkage. Imagine K lists of records, where each record may or may not refer to a target population unit (i.e. latent entity). Provided the union of the lists entail only over-counts of the target population, a potential alternative approach is record linkage, also referred to as entity resolution or co-reference – see e.g. Stoerts, Hall, and Fienberg (2015). The records in the same list that refer to the same entity represent duplicated enumerations; the records in the different lists that refer to the same entity can be conceived as the target for record linkage. The errors in compiling the population total are then the potential de-duplication and record linkage errors, which are traditionally the topics of computerized record linkage.
Multiple macro-level proxy totals may need to be reconciled under the symmetric-unlinked setting. A typical example is multiple time series with different frequencies, e.g. with register-based yearly figures and survey-based sub-annual figures. Another example is the Supply-and-Use Tables for the production of GDP, where the initial estimates generally do not balance out because they are derived from different sources, or when the GDP is compiled using different approaches. Census output tables derived from fragmented data sources instead a one-number file is yet another example. See e.g. Bikker, Daalmans, and Mushkudiani (2013) and Mushkudiani, Daalmans, and Pannekoek (2014, 2015).
Reconciliation is often achieved as the solution to a constrained optimization problem. The approach requires the specification of two components. A loss function may be defined to measure the changes from the initial proxy estimates to the final reconciled estimates. The constraints that the final estimates must satisfy need to be explicitly stated, which may contain both equality and inequality constraints. Minimizing the loss function subjected to the constraints would then yield the final estimates. The approach is feasible without linked data across the sources. Notice that there are many advanced techniques of constrained optimization in Applied Mathematics, Engineering, and Computer Sciences.
Mushkudiani, Pannekoek, and Zhang (2016) develop scalar uncertainty measure of macro accounts to replace, say, the entire variance–covariance matrix of all the estimates involved. Devising simple summary statistical uncertainty measures for an accounting system such as the System of National Account can be helpful in at least two respects: (i) it can inform the choice among alternative adjustment methods that seem equally viable to start with, (ii) it can identify and assess the changes, or potential improvements, that are most effective in terms of the final estimated account directly. Implementation of the approach to the System of National Account is currently under development.