Читать книгу Handbook of Web Surveys - Jelke Bethlehem - Страница 21

1.2.5 WEB SURVEYS AND OTHER SOURCES

Current digital environment and technology trends are providing a huge amount of data about most phenomena. These data are available on the web. Often they are free of charge, if not protected for privacy.

Examples of data available in digital format are credit card transactions, tax data, social chatting, telephone use (calls details: time, location, length of the call, etc.), social security payments, GPS, videos. Using this type of data for statistical purposes is appealing and challenging. The term big data is currently used to characterize data with high volume, velocity, and variety. There is a debate on the definition and on the use of big data for statistical purposes.

Roughly speaking, big data are based on the automatic collection on everything that people do; they are not subject to statistical classification criteria and to statistical treatment for representativity. Also administrative data, i.e., information that are collected for registering units (people, businesses, sales, and so on) into an activity process, might be included into big data.

Practitioners and researchers are now wondering if big data could substitute web surveys to provide information for social and economic decision making. For a discussion about that, see Couper (2013).

It should be emphasized that conclusion is that big data and web surveys are complementary data sources, not competing data sources. The availability of big data to support research provides a new way to approach old questions as well as an ability to address some new questions that in the past were not considered.

Web surveys may be run as stand‐alone surveys. However, source integration is a major trend for the future of the next 10 years of web surveys.

It is a new area of research to achieve the three following goals: (1) Minimize the cost associated with surveys. (2) Maximize the information, i.e., the findings based on big data generate more questions, and some of those questions could be best addressed by web surveys or other traditional survey methods. Moreover, information from one source could be useful for improving data to be estimated from a survey. (3) Minimize the respondent burden. Integration alleviates the burden of duplicating data gathering efforts and enables the extraction of information that would otherwise be impossible.

Therefore, it is necessary to work in the direction of using big data and integrating them with the survey results. It is important to face experimental applications having in mind the characteristics, nature, and the limitations of big data as statistical sources and the methodological soundness of the survey results.

At the time being, market research and private/public businesses have great interest in trying to use big data to investigate markets and individual behavior. The use of this data as exploratory source is the most plausible application, whereas using this data for statistical purposes and integration with web survey requires still a lot of effort around definitions, classifications, and estimation methodological problems.

Official statistics producers are investigating how to use other big data sources and how to produce estimates in a multisource framework. Some experiments have been already undertaken, with contrasting results. Most successful applications consist in the integration of web survey data and administrative data (i.e., administrative data could be considered a type of big data according to many authors). Administrative data have been used:

To generate a survey frame or to supplement/update an existing frame. When surveys are run on the web, administrative data integration could help in applying the adaptive survey design (see Chapter 8) to improve the data collection process. An ultimate task could be the replacement of data collection (e.g., use of taxation data for small businesses instead of seeking survey data for them). In this case, however, data replacement would be about some basic data; surveys will be anyway useful for collecting data about specific topics and behaviors. For example, think of all the surveys with a focus on consumption. If big data assets are providing insights on consumption via passive observation, primary research via surveys will not have to collect this type of information, and it is finally possible to deliver on the vision of shorter surveys instead of simply providing complementary data to the desired information. Surveys can be short and focused on those variables that they are ideally suited for, resulting in better data quality.

In editing and imputation.

In estimation (e.g., as auxiliary information in calibration estimation, benchmarking, or calendarization).

In comparing survey estimates with estimates from a related administrative program as well as other forms of survey evaluation have been experienced.

When using a multisource approach in web surveys, several aspects should be considered.

First of all is the heterogeneous nature of the sources with respect to the following basic characteristics: the aggregation level, the unit, the variables, the coverage, the time, the population, and the data type:

The aggregation level, i.e., some data sources consist of only microdata, some other data sources consist of a mix of microdata and aggregated data, whereas in some other cases data sources consist of only aggregated data. In some case, aggregated data are available besides microdata. There is still overlap between the sources, from which there arises the need to reconcile the statistics at some aggregated level. Of particular interest is when the aggregated data are estimates themselves. Otherwise, the conciliation can be achieved by means of calibration, a standard approach in survey sampling.

As regards the units, it has to be considered that sometimes there are no overlapping units in the data sources or only some units are overlapping. Also, as regards the variables, no overlapping variables in the data sources could occur, or only variables in the data sources could overlap.

Under‐coverage versus there is no under‐coverage has to be considered. The data sources are cross‐sectional versus other data sources are longitudinal; thus, the researcher should take care of what type of data he is integrating. The set of population units from a population register could be known, or the population list is not known; this affects the possibility of generating a probability‐based sample. In some cases, a data source contains a complete enumeration of its target population, or a data source is selected by means of probability sampling from its target population, or a data source is selected by non‐probability sampling from its population. The database may be further split into two subcases depending on whether one of the data sources consists of sample data (and where the sampling aspects play an important role in the estimation process) or not. In the former case, specific methods should be used in the estimation process, for instance, taking the sampling weights into account and considering that sample data may include specific information that is not reported in register.

Another aspect is the configuration of the sources to be integrated. There are a few basic ways, most commonly encountered. However, in practice, a given situation may well involve several basic configurations at the same time.

The first and most basic configuration of the integration process of different sources is multiple cross‐sectional data that together provide a complete data set with full coverage of the target population. Provided they are in an ideal error‐free state, the different data sets, or data sources, are complementary to each other and can be simply “added” to each other in order to produce output statistics.

A second type of configuration is when there exists overlap between the different data sources. The overlap can concern the units, the measured variables, or both.

A third situation is when the combined data entail under‐coverage of the target population in addition, even when the data are in an ideal error‐free state.

A further configuration is when microdata and aggregated data are available. There is overlap between the sources, but there is the need to reconcile the statistics at some aggregated level. The conciliation can be achieved by means of calibration, which is a standard approach in survey sampling. Of particular interest is when the aggregated data are estimates themselves.

Finally, it is possible that multisource approach refers to longitudinal data. More questions arise; the most important issue is that of reconciling time series of different frequencies and qualities. For example, one source has monthly data and the other source has quarterly data.

Integration may occur between different types of sources: surveys (mainly web surveys), administrative data, other passive collected data, social network, and other unstructured data.

Integration of different configurations as well as of different types of data sources implies different methodological problems. For instance, integrating survey and administrative data through unit record linkage requires improving coherence across data collections, using standard classifications and questions, rationalizing content between surveys, and processes for combining separate sample surveys into one survey vehicle (Bycroft, 2010).

Example 1.7 discusses an integration of web scraped information, administrative data, and surveys, whereas Example 1.8 shows an application of integration between survey data and social network unstructured information.

As a result of the multisource integration, statistical output is based on complex combinations of sources. Its quality depends on the quality of the primary sources and the ways they are combined. Some studies are investigating the appropriateness of the current set of quality measures for multiple source statistics; they explain the need for improvement and outline directions for further work.

Подняться наверх