Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 19

1.2.1 Representation

Оглавление

We start with the Representation side in Figure 1.1, which concerns the target population and units. Let us consider coverage error first. For instance, one may have a Population Register that is not sufficiently accurate to allow for direct tabulation of census-like population counts at detailed aggregation levels, so that Population Coverage Surveys are carried out in order to obtain the desired population estimates. The Population Register and Coverage Survey enumerations are proxies of the true population enumeration. This is the situation in Switzerland 2000 (Renaud 2007) and Israel 2008 (Nirel and Glickman 2009). Other instances may involve one or several register enumerations, Census enumeration and Census Coverage Survey enumeration. Capture–recapture methodology is a commonly used estimation approach that combines two or more proxy enumerations subjected to under-counts (Fienberg 1972; Wolter 1986; Hogan 1993). Adjustment of erroneous over-counts has attracted increasing attention recently, in situations where one does not have a Population Register and over-coverage errors are found to be large in the available register enumerations (ONS 2013). See, e.g. Zhang (2015b), for an extension of the capture–recapture modeling approach, Zhang and Dunne (2017) for trimmed dual-system estimation, and Di Cecco et al. (2018) for a latent class modeling approach.


Figure 1.1 Second phase of integrated statistical micro data.

Source: Zhang (2012).

All the methods mentioned above require the matching of records in separate sources. In reality, linkage errors may be unavoidable, unless a unique identifier exists in the different files and facilitates exact matching. A linkage error is the case if either a pair of linked records actually do not belong to the same entity or if the records that belong to the same entity fail to be linked. Both multi pass deterministic and probabilistic record linkage procedures are common in practice and often used in tandem. See, e.g. Fellegi and Sunter (1969) and Herzog, Scheuren, and Winkler (2007). The records in different files are compared to each other in terms of key variables such as name, birth date, address, etc. One can regard a concatenated string of key variables as a proxy of the true identifier, insofar as the key variables involved in principle could lead to unique combinations. Distortion of the key variables would then result in erroneous proxy identifiers and potentially cause linkage errors. For population size estimation, see, e.g. Di Consiglio and Tuoto (2015) for a study of linkage errors and dual system estimation, and Zhang and Dunne (2017) for a discussion regarding the trimmed dual-system estimation. More generally, since record linkage is a prerequisite for combining multisource data at the individual level, the matter of linkage errors due to imperfect proxy identifiers can be relevant in many other situations.

In a frame that is constructed from combining multiple population datasets, one can often find several related classification variables. Identification errors (Zhang 2012) arise if the classification variables or the relationships between them are mistaken based on the input datasets. For instance, the variable address is central for population and household statistics. Multiple addresses can be collected by combining the Population Register with resident address, the Post Register with postal address, the Higher-Education Student Register with term-time address, the various Utility datasets with occupant address, etc. Each person may be assigned a unique de jure address based on all these sources, in a way that is judged to be most appropriate, which would then yield a proxy variable for the de facto address that is of interest in many social-economic statistics.

The economic activity classification, e.g. NACE in Europe, is a well-known example in business statistics. The NACE code in the Business Register is generally a proxy of the target “pure” economic classification that has its root in the System of National Accounts. Several issues contribute to this fact, such as inconsistent operational rules of the Business Register, misreporting, lack of updation, etc. It is common in sample surveys to observe that for some units the NACE code based on the updated survey returns will differ from the existing one in the Business Register. Such domain classification error is a kind of identification error. See, e.g. Brion and Gros (2015) for an example of how the matter is dealt with in the French Structural Business Surveys, and Van Delden, Scholtus, and Burger (2016) for an analysis of the NACE-classification errors in the Dutch context.

For survey data, the statistical unit can be identified in fieldwork. Based on register data, however, it is sometimes necessary to construct proxy statistical unit of interest, in which case unit errors may be unavoidable even if all the input data are error-free. For instance, consider register-based household. Provided all dwelling (or address) in the Population Register are correct, one may define a dwelling household to consist of all the persons who de jure share the same dwelling. We do not consider such a dwelling household to be a constructed statistical unit, precisely because it can be obtained from error-free input data directly. The perfection is another way of saying that there are no identification errors. An example of a constructed unit in this context is living household, which does not have to include everyone registered at the same dwelling nor be limited to these. Errors in a constructed living household is the case if two persons in different living households are placed in the same constructed living household, or if two persons in the same living households are placed in different constructed living households.

Constructed or not, unit error can be the case whether it results from lack of data or errors in data. Zhang (2011) devises a mathematical representation of unit error. It is assumed that each statistical unit of interest can consist of one or several so-called base units, but never cuts across a base unit. For example, person can be the base unit for household. The mapping from the set of base units to the set of statistical units can then be specified in terms of an allocation matrix, where each element takes value 1 or 0 depending on whether or not the corresponding base unit (arranged by column) belong to the statistical unit (arranged by row). In the case where a base unit can be assigned to one and only one statistical unit, such as a person can only belong to one household, the column sum of the allocation matrix is always equal to 1. Zhang (2011) develops a unit error theory for household statistics. Despite the unit error clearly being one of the most fundamental difficulties in business statistics, a statistical theory has so far been lacking. This may be partly due to the prominence of the identification error mentioned above. Another important reason may simply be the lack of a commonly acknowledged choice of base unit in business statistics.

Administrative Records for Survey Methodology

Подняться наверх