Читать книгу An Introduction to Text Mining - Gabe Ignatow - Страница 30
Introduction
ОглавлениеWhile social scientists have for decades made use of data from attitude surveys, today researchers are attempting to leverage the growing volume of naturally occurring unstructured data generated by people, such as text or images. Some of these unstructured data are referred to as “big data,” although that term has become a bit of a faddish buzzword. Naturally, there are questions that arise from the use of textual data sets as a way to learn about social groups and communities. There are, of course, advantages and disadvantages to each, and there are also ways to leverage both surveys and big data.
Surveys are the traditional mechanisms for gathering information on people, and there are entire fields that have developed around these data collection instruments. Surveys can collect clear, targeted information, and as such, the information obtained from surveys is significantly “cleaner” and significantly easier to process as compared to the information extracted from unstructured data sources. Surveys also have the advantage that they can be run in controlled settings, with complete information on the survey takers. These controlled settings can however also be a disadvantage. It has been argued, for instance, that survey research is often biased because of the typical places where surveys are run—for example, large student populations from Introduction to Psychology courses. Another challenge associated with surveys is that it excludes those people who do not like to provide information, and there is an entire body of research around methodologies to remove such participation bias. Above all, the main difficulty associated with survey instruments is the fact that they are expensive to run, both in terms of time and in terms of financial costs.
The alternative to surveys that has been extensively explored in recent years is the extraction of information from unstructured sources. For instance, rather than surveying a group of people on whether they are optimistic or pessimistic, alongside with asking for their location, as a way to create maps of “optimism,” one could achieve the same goal by collecting Twitter or blog data, extracting the location of the writers from their profile, and using automatic text classification tools to infer their level of optimism (Ruan, Wilson, & Mihalcea, 2016). The main advantage of gathering people information from such data sources is their “always on” property, which allows one to collect information continuously and inexpensively. These digital resources also eliminate some of the biases that come with the survey instruments, but they nonetheless introduce other kinds of biases. For instance, most of these data-driven collections of information on people rely on social media or on crowdsourcing platforms such as Amazon Mechanical Turk, but these sources cover only a certain type of population who is open to posting on social media or participating in online crowdsourcing experiments. Even more important, another major difficulty associated with the use of unstructured data sources is the lack of exactness during the process of extracting information. This process often consists of automatic tools for text mining and classification, which even if they are generally very good, they are not perfect. This effect can, however, be counteracted with the use of large data quantities: If the data that one can get from surveys are often limited by the number of participants (which in turn is limited by time and cost reasons), that limit is much higher when it comes to the information that one can gather from digital data sources. Thus, if cleverly used, the richness of the information obtained from unstructured data can rival, if not exceed, the one obtained with surveys.