Читать книгу Administrative Records for Survey Methodology - Группа авторов - Страница 10
Preface
ОглавлениеSample surveys are used by governments to describe the populations of their countries and provide estimates for use in policy decision making. Surveys can focus on individuals, households, businesses, students and schools, patients and hospitals, plots of land, or other entities. For surveys to be useful for official purposes they must cover the target population, represent the entirety of the population, collect information on key variables with accurate measurement methods, and have large enough sample sizes so that estimates are sufficiently precise at national and subnational levels. Achieving these four goals in a nationwide sample survey with a limited budget while being conducted in a short time interval is very challenging. The purpose of this book is to explore developments in the use of administrative records for improving sample surveys.
Sample surveys aim to gather information on a population. The target population is the specific part of the population that one aims to survey. Some parts of the broader population typically are excluded from the target population based on contact mode, data collection mode, the survey frame or list, or convenience. Individuals without a regular address, residing in some forms of group quarters, or without phone or Internet access, for example, might be effectively ineligible to serve as respondents. Survey frames record contact information and some other variables on members of a population, but of course they do not necessarily include all members of the population and have up-to-date information on everyone. Some individuals with accurate contact information in the frame will prove harder than others to contact or even refuse to participate. Surveys then are potentially limited to reporting about respondents and the population to which they are similar. Surveys cannot be overly long or else they risk deterring potential respondents and costing a lot of money per respondent. As a result, surveys can accommodate only so many questions. Self-report and less detailed questions, with their inherent limitations, for sensitive and complex items, often must be used for expediency. Budgets for national surveys compete with other government interests. Even large surveys typically have smaller-than-desired sample sizes in local areas and in subsets of the population. Despite these significant challenges, official statistical agencies around the world gather critically useful data on a myriad of topics.
The conditions for conducting sample surveys have changed immensely in the past 100 years. There is little chance that change will slow down. In-person surveys have been replaced and augmented by surveys by mail, by phone, and by Internet. Contact and data collection via multiple modes now are standard. The social environment, too, has evolved. Response rates are lower. Despite technological advances, people are increasingly busy. Official government surveys compete for attention with ever-more marketing and polling. Concerns over privacy and confidentiality have been elevated, rightly so, in the public consciousness. Simultaneously, government, researchers, and the public want more from data and surveys. Official surveys contribute to identifying challenges and to improvements in society. It is not practical, or maybe even possible, to get more out of old ways of conducting surveys.
Administrative records in a general sense are records kept for administrative purposes of the government. Administrative records can pertain to almost all aspects of life, including taxes, wages, education, health, residence, voting, crime, and property and business ownership. Does an individual have a license for a dog, for fishing at public lakes, to drive a car or motorcycle, or to own a gun? Does an individual receive public assistance through a government program? Administrative records, essential for government operations, contain a wealth of information on large segments of the population, but there are limitations. The records contain information on only some variables on subsets of the overall population. Information is collected so that a government can execute its program, but not typically for other purposes. Additional variables that might be interesting for study purposes likely are not recorded. Methods of recording variables might not be those that would be used in a scientific study. Those included in an administrative data file are not a random sample from the population. Some administrative records are collected over the course of several months or years, instead of only during a succinct time interval.
The use of administrative records has been part of the survey process for many decades. Survey textbooks since at least the 1960s (Cochran 1977; Kish 1967; Hansen, Hurwitz, and Madow 1953; Särndal, Swensson, and Wretman 1992) present methods for using auxiliary variables. It typically is assumed that values of auxiliary variables are available for all members of the population without error, or at least that aggregate totals are known. They might have come from a census, from a large survey at a previous time, or as part of the sample frame. Auxiliary variables are used for stratified surveys, probability proportional to size sampling, difference estimation, and ratio estimation. Often, they are treated in classic literature as known, fixed values.
Despite the limitations of administrative records, researchers, including the authors in this book, have been exploring how “adrecs” can be used to improve sample surveys in today’s world and build on the record of past successes. They have examined new possibilities for using administrative record information to address four goals (coverage, response, variables, and accuracy) of official surveys. Increasing timeliness and decreasing costs through use of administrative records also are of continuing interest.
The book is organized into four sections. The first section contains two chapters. Chapter 1, by Li-Chun Zhang, presents fundamental challenges and approaches to integrating survey and administrative data for statistical purposes. The chapter focuses on administrative data, also called register or registry data, as a source for proxy variables. The proxy variables obtained from administrative sources can, for example, enhance a survey by providing additional information, be used for quality assessment of responses, and provide substitutes for missing values. Chapter 2, by John Marion Abowd, Ian Schmutte, and Lars Vilhuber addresses confidentiality protection and disclosure limitation in linked data. Linking data on population elements is an essential step for many uses of administrative records in conjunction with survey data. If individuals from a survey can be located uniquely in administrative records, then variables in those administrative records can be meaningfully associated with their originating units, thereby generating useful proxy variables. Data files from surveys, both from those linked to administrative information and those not, are made available to researchers and policy analysts. In standard practice, values of personally identifying information, such as names, fine-level geographic information including addresses, birthdates, and identification numbers, are suppressed. A data file containing a rich set of variables for analysis, however, increases the chance that someone could identify a unique individual from the survey in the population based on the values for several variables. The concern is that such an identification violates legal promises of confidentiality, causes harm to individuals who view their survey responses and administrative information as sensitive, and endangers future survey operations. Chapter 2 describes three applications, traditional statistical disclosure limitation methods, and new developments. The article includes discussion of how researchers access data (access modalities) and the usefulness (analytic validity) of data made available after modification for enhanced disclosure limitation.
Section 2 groups together five chapters on data quality and record linkage. Chapter 3, by Piet Daas, Eric Schulte Nordholt, Martjin Tennekes, and Saskia Ossen, examines the quality of administrative data used in the Dutch virtual census. A challenge in assessing quality of a data source is having better information on some variables for at least a subset of the population. Coen Hendriks, in Chapter 4, reports on improving the quality of data going into Norwegian register-based statistics. In Chapter 5, William Winkler considers a wide range of topics from initial cleaning of data files, record linkage, and integrated modeling, editing, and imputation. The impact of cleaning data files through standardizing variables, parsing variables such as addresses into separable components, and checking for logical errors cannot be overstated. Various approaches are in use for linking records from two files on the same population. Dr. Winkler reviews several enhancements, including variations in string comparator metrics and memory indexing, that have been put into practice at the U.S. Census Bureau. Jerry Reiter writes about assessing uncertainty when using administrative records in Chapter 6. Along with survey estimates, one typically needs to provide estimates of standard error. How do the quality of administrative records and the performance of the linkage to the survey impact the accuracy of estimates? Multiple imputation (Rubin 1986, 1987) could be one area for further exploration. In Chapter 7, Joseph Sakshaug addresses the specific question of measuring and controlling non-consent bias when surveys and administrative data are linked together. It is increasingly common for surveys that plan to link respondents to administrative data to ask for permission to do so. Some individuals refuse to give permission for linkage or cannot be linked due to other reasons, such as refusing to provide information on key linkage variables. Those whose records are not linkable can be different in many ways from those whose records are. Bias due to non-consent to linkage and failed linkage is therefore a novel contributing factor to total survey error.
Section 3 contains four articles on uses of administrative records in surveys and official statistics. Chapter 8 by Ingegerd Jansson, Martin Axelson, Anders Holmberg, Peter Werner, and Sara Westling describes experiences in the first Swedish register-based census of the population. In a register-based census, the population is counted and characteristics are gathered directly from administrative records, which, in this case, are referred to as population registers. Chapter 9 by Vincent Tom Mule and Andrew Keller of the U.S. Census Bureau presents research on administrative records applications for the U.S. 2020 Decennial Census of the population. In the U.S., there is no universal population register and the census involves enumerating and gathering basic information on every person in the country. Administrative records have been used to improve the data gathering process in the past. This chapter describes expanded options for improved design, quality and accuracy assessment, and dealing with missing information. Chapter 10 by Andrea Erciulescu, Carolino Franco, and Partha Lahiri concerns methods for improving small area estimation using administrative records. Surveys are designed to provide accurate estimates at a national or large subnational level, but not typically for small geographic areas or groups. Small area estimation uses models that provide a rationale for borrowing strength of sample across small areas for local estimation. The methodology relies on an advantageous bias–variance trade-off and estimation admissibility ideas (e.g. Efron and Morris 1975). Administrative records can provide key variables for use in such models.
Section 4 looks beyond statistical methodology for use of administrative records with surveys and provides three articles about using administrative data in evidence-based policymaking. The applications are in health, economics, and education. Chapter 11, by Cordell Golden and Lisa Mirel, focuses on enhancement of health surveys at the U.S. National Center for Health Statistics, through data linkage. Chapter 12, by Bruce Meyer and Nikolas Mittag, concerns economic policy analysis, with an emphasis on using administrative records to improve income measurements. Chapter 13, by Peter Siegel, Darryl Creel, and James Chromy, discusses combining data from multiple sources in the context of education studies.
The book is intended for a diverse audience. It should provide insight into developments in many areas and in many countries for those conducting surveys and their partners who manage and seek to improve administrative records. Several articles present theory as well as application and advice based on practical experience. Many chapters in the book include exercises for reflection on the material presented. The book could be of interest to students of statistics, survey sampling and methodology, and quantitative applications in government. Certainly, the book will have useful chapters for a variety of courses.
Data science has emerged as a term for an integration of statistics, mathematics, and computing and their integration in the effort to solve complex problems. Administrative records along with large-scale sample surveys provide a setting for the best applications in data science. This book hopefully will motivate those in the data science community to learn about survey sampling, official statistics, and a rich body of work aiming to utilize administrative records for sample surveys and survey methodology.
23 May 2020
Asaph Young Chun
Statistics Research Institute
Statistics Korea, Republic of Korea
Michael D. Larsen
Department of Mathematics and Statistics
Saint Michael’s College, United States
Gabriele Durrant
Department of Social Statistics and Demography
Southampton University, UK
Jerome P. Reiter
Department of Statistical Science
Duke University, United States