Читать книгу Handbook of Web Surveys - Jelke Bethlehem - Страница 22

EXAMPLE 1.7 Web scraping, administrative data, and surveys

Оглавление

Istat since 2015 has been experimenting web scraping, text mining, and machine learning techniques in order to obtain a subset of the estimates currently produced by the sampling survey on “Survey on ICT Usage and e‐Commerce in Enterprises” yearly carried out on the web. Studies from Barcaroli et al., (2015, 2016), and Righi, Barcaroli, and Golini (2017) have focused in implementing the experiment and in evaluating data quality.

Trying to make the optimal use of all available information from the administrative sources to web scraping information, web survey estimates produced could tentatively be improved. The aim of the experiment is also to evaluate the possibility to use the sample of surveyed data as a training set in order to fit models to be applied to website information.

Recent and in progress steps are a further improvement to the performance of the models by adding explicative variables consisting not only of single terms, but the joint consideration of sequences of terms relevant for each characteristic object of interest. When a certain degree of quality of the resulting predictive models will be guaranteed, their application to the whole population of enterprises owning a website will be performed. A crucial task will be also the retrieval of the URLs related to the websites for the whole population of enterprises. Finally, once having predicted the values of the target variables for all reachable units in the population, the quality of estimates obtained will be analyzed and compared with the current sampling estimates obtained by the survey. In a simulation study, Righi, Barcaroli, and Golini (2017) found that the use of auxiliary variable coming from the Internet DB source highly correlated with the target variable does not guarantee enhancement of the quality of the estimates if selectivity affects the source. Bias may occur due to absence of some subgroups. Thus an analysis of the DB variable and the study of the relationship between populations covered or not by the DB source is a fundamental step to know how to use and which framework implement to assure high‐quality output.

In conclusion, the approach that uses web scraping and administrative data together with the web survey looks to be promising; nevertheless quality results of the estimations are satisfactory only in some cases. The use of big data has to be carefully evaluated, especially if selectivity affects the source.

Example 1.8 focuses on an experiment of integration of social media data and surveys. Even if the study lacks statistical representativeness and indicators, it presents an interesting approach that should be deeply investigated and statistically formalized.

Handbook of Web Surveys

Подняться наверх