Читать книгу Social Monitoring for Public Health - Michael J. Paul - Страница 14

Оглавление

CHAPTER 4

Methods of Monitoring

This chapter surveys methodology: the types of information that can be analyzed and how to do so, covering machine learning, statistical modeling, and qualitative methods. We will start by discussing quantitative methods—statistical analysis of data—including large-scale computational approaches to categorizing and extracting trends from social data, both at the level of populations and individuals. We also discuss validation, how you know when to trust your analysis. We then briefly discuss qualitative methods as a potentially richer but smaller-scale methodology. Lastly, we discuss different issues involved in designing a study, including methods for inferring population demographics, an important component of public health research.

This chapter touches on some advanced concepts in machine learning that won’t be taught in depth in this book—though we do provide a few pointers to other tutorials and tools. Our aim is to provide a high-level overview of these methods, introducing important terminology, surveying different ways of approaching a problem, and giving examples of typical pipelines for conducting social monitoring.

4.1 QUANTITATIVE ANALYSIS

We begin by surveying the common quantitative methods for analyzing social data. We will summarize methods for identifying and filtering for relevant data, then analyzing the data, for example by extracting trends, and then validating the extracted information. This pipeline of quantitative methods is illustrated in Figure 4.1.

We will use one of the most common social monitoring uses, influenza surveillance, as our running example of social monitoring (with other tasks mentioned as needed) in order to illustrate the quantitative methodologies, but these methods are applicable to other public health problems as well.

The goal of influenza surveillance (described later in Section 5.1.1) is to measure the prevalence of influenza (flu) infection in a population. Official monitoring by government health agencies is delayed by at least one to two weeks, so social media has been used as a real-time supplementary source of monitoring. If you are familiar with social monitoring of influenza, you may find it strange that we chose to use it as our running example: the most popular system, Google Flu Trends, has been widely criticized for being unreliable. However, keep in mind that Google Flu Trends was one of the earliest systems to do this, using methods that are limited by today’s standards. While the system resulted in substantial errors, they are errors that could have been avoided using more sophisticated techniques, including those implemented by Google Flu Trends itself in later iterations [Santillana, 2017]. The takeaway is not that social monitoring for flu doesn’t work, but that it must be done thoughtfully and validated extensively. We will point out potential pitfalls as we go along, discussing validation in Sections 4.1.4 and 4.2.1, with general limitations discussed extensively later in Chapter 6.

Figure 4.1: A standard pipeline of quantitative methods for inferring trends from social data. The various steps are described in the indicated sections.

4.1.1 CONTENT ANALYSIS AND FILTERING

The first step in any data driven project is to ensure you have the data! When it comes to social monitoring, and the data comes in the form of tweets or messages on a variety of topics, it may be challenging to know if the available data support your research aims. Before investing time into planning a project, or collecting and processing data, you should determine if the data supports your goals. We typically advise researchers to identify 10 messages (by hand or through keyword search) that exemplify the data needed for the project. For example, Twitter provides a web search interface that makes these types of explorations easy.¹ This process can also help you decide the best method for filtering the data. If you can’t find enough data at this stage, it’s unlikely you’ll be able to automatically mine the needed data.

When you know what you are looking for, you are ready to filter the data down to the subset of data relevant to the public health task at hand. For example, if the task is to conduct disease surveillance, then one must identify content that discusses the target disease (e.g., influenza). Approaches to filtering include searching for messages that match certain phrases, or using more sophisticated machine learning methods to automatically identify relevant content. We now describe these approaches in a bit more detail.

Keyphrase Filtering or Rule-based Approaches

Arguably the simplest method for collecting relevant content is to filter for data (e.g., social media messages or search queries) containing certain keywords or phrases relevant to the task. For example, researchers have experimented with Twitter-based influenza surveillance by filtering for tweets contain words like “flu” or “fever” [Chew and Eysenbach, 2010, Culotta, 2010, 2013, Lampos and Cristianini, 2010]. For Twitter data, tweets matching certain terms can straightforwardly be collected using Twitter’s Search API, described in Section 3.5. We note that there exist clinically validated sets of keywords for measuring certain psychological properties, such as emotions [Pennebaker et al., 2001].

Keyword and phrase-based filtering is thought to be especially effective for search queries, which are typically very short and direct, compared to longer text, like social media messages [Carmel et al., 2014]. Search-driven systems like Google Flu Trends [Ginsberg et al., 2009] rely on the volume of various search phrases. Most research that uses search query volumes is in fact restricted to phrase-based filtering, as data available through services such as Google Trends (described in Section 3.5) come as aggregate statistics about certain search terms, rather than the raw text that is searched, which is private data.

A special type of keyword is a hashtag. Hashtags are user-created labels (denoted with the # symbol) used to organize messages by topic, used primarily in status updates (e.g., on Twitter) or photo captions (e.g., on Instagram). Because hashtags are widely used by different users, they can serve as useful filters for health monitoring. For example, if one was interested in understanding physical activities in a population, one might search for hashtags such as #workout or #running. However, additional filtering may be needed to distinguish between messages by ordinary users and by advertisers or media outlets, e.g., “I had a great #workout today!” vs. “Top 10 #Workout Tips.” Rafail [2017] cautions that hashtag-based samples of tweets can be biased in unexpected ways.

Beyond searching for keywords or hashtags, other rules can be applied to filter for data. For example, one might choose to exclude tweets that contain URLs, which are less likely to be relevant for flu surveillance [Lamb et al., 2013]. By using machine learning, described in the next subsection, systems can learn which characteristics to favor or disfavor, rather than defining hard rules by hand.

Machine Learning Classification

Keyword-based filtering is limited because it does not distinguish between different contexts in which words or phrases appear. For example, not all tweets that mention “flu” indicate that the user is sick with the flu; a tweet might also discuss influenza in other contexts (for example, reporting on news of laboratory experiments on influenza) that are not relevant to surveillance.

A more sophisticated approach is to use machine learning to categorize data for relevance based on a larger set of characteristics than words alone. An algorithm that automatically assigns a label to a data instance (e.g., a social media message) is called a classifier. A classifier takes a message as input and outputs a discrete label, such as whether or not a message is relevant. For example, Aramaki et al. [2011] and Achrekar et al. [2012] constructed classifiers to identify tweets that are relevant to flu surveillance. Others have built classifiers to identify tweets that are relevant to health in general [Paul and Dredze, 2011, Prieto et al., 2014, Yin et al., 2015]. Lamb et al. [2013] combined multiple classifiers for a pipeline of filtering steps: first, a classifier identifies if a message is relevant to health, and if so, a second classifier identifies if a message is relevant to flu.

Classifiers learn to distinguish positive and negative instances by analyzing a set of labeled examples, and patterns learned from these “training” examples can then be used to make inferences about new instances in the future. Because training data is provided as examples, this approach is called supervised machine learning.

Common classification models include support vector machines (SVMs) and logistic regression, sometimes called a maximum entropy (MaxEnt) classifier in machine learning [Berger et al., 1996]. Logistic regression is commonly used for public health, traditionally as a tool for data analysis (see discussion of regression analysis in Section 4.1.3) rather than as a classifier, which predicts labels for new data. Recent advances in neural networks—loosely, models that stack and combine classifiers into more complex models—have made this type of model attractive for classification [Goldberg, 2017]. While more computationally intensive, neural networks can give state-of-the-art performance for classification.

Classifiers treat each message as a set of predictors, called features in machine learning, typically consisting of the words in a document, and sometimes longer phrases as well. Phrases of length n are called n-grams, while individual words are called unigrams. One can also use additional linguistic information as features. Natural language processing (NLP) is an area of computer science that involves processing human language, and a number of NLP tools exist to parse linguistic information from text. For example, Lamb et al. [2013] showed that classification performance can be improved by including linguistic features in addition to n-grams, like whether “flu” is used as a noun or adjective, or whether it is the subject or object of a verb.

We won’t get into the technical details of classification in this book, but many of the common toolkits for machine learning (a few of which are described at the end of this section) provide tutorials.

Unsupervised Clustering and Topic Modeling

An alternative to classification is clustering. Clustering has the same goal as classification—organizing messages into categories—but the categories are not known in advance; rather, messages are grouped together automatically based on similarities. This is a type of unsupervised machine learning.

A popular method of clustering for text documents is topic modeling. In particular, probabilistic topic models are statistical models that treat text documents as if they are composed of underlying “topics,” where each topic is defined as a probability distribution over words and each document is associated with a distribution over topics. Topics can be interpreted as clusters of related words. In other words, topic models cluster together words into topics, which then allows documents with similar topics to be clustered. Probabilistic topic models have been applied to social media data for various scientific applications [Ramage et al., 2009], including for health [Brody and Elhadad, 2010, Chen et al., 2015b, Ghosh and Guha, 2013, Paul and Dredze, 2011, 2014, Prier et al., 2011, Wang et al., 2014].

The most commonly used topic model is Latent Dirichlet Allocation (LDA) [Blei et al., 2003], a Bayesian topic model. For the domain of health, Paul and Dredze developed the Ailment Topic Aspect Model (ATAM) [2011, 2014], an extension of LDA that explicitly identifies health concepts. ATAM creates two different types of topics: non-health topics, similar to LDA, as well as special “ailment” word distributions with words that are found in dictionaries of disease names, symptom terms, and treatments. Examples of ATAM ailments are shown in Figure 4.2.

An advantage of topic models over simple phrase-based filtering is that they learn many words that are related to concepts. For example, words like “cough” and “fever” are associated with “flu.” When inferring the topic composition of a document, the entire context is taken into account, which can help disambiguate words with multiple meanings (e.g., “dance fever”). A disadvantage is that they are typically less accurate than supervised machine learning methods, but the tradeoff is that topic models can learn without requiring annotated data. Another consideration of topic models is that they discover broad and popular topics, but additional effort may be needed to discover finer-grained issues [Prier et al., 2011].

Another use of topic models, or unsupervised methods in general, is for exploratory analysis. Unsupervised methods can be used to uncover the prominent themes or patterns in a large dataset of interest to a researcher. Once an unsupervised model has revealed the properties of a dataset, then one might use more precise methods such as supervised classification for specific topics of interest.

The technical details of probabilistic topic models are beyond the scope of this book. For an introduction, we recommend reading Blei and Lafferty [2009].

Which Approach to Use?

We have mentioned a variety of approaches to identifying social media content, including keyword filtering, classification, and topic modeling. These approaches have different uses and tradeoffs, so the choice of technique depends on the data and the task.

Most research using a large, general platform like Twitter will require keyword filtering as a first step, since relevant content will be such a small portion of the overall data, whether that requires keywords related to a particular topic like flu or vaccination, or health in general—for example, Paul and Dredze [2014] used a few hundred health-related keywords to collect a broad range of health tweets, which is still only a small sample of Twitter. Keyword filtering can be reasonably reliable for obtaining relevant content, although it may miss data that is relevant but uses terminology not in the keyword list, or it may identify irrelevant data that uses terms in different ways (e.g., slang usage of “sick”). Classifiers can overcome the limitations of keyword filtering, but are time consuming to build, so they are generally considered as a next step if keywords are insufficient. Topic models, on the other hand, are most often used for exploratory purposes—understanding what the content looks like at a high level—rather than looking for specific content.

Figure 4.2: Examples of ailment clusters discovered from tweets, learned with the Ailment Topic Aspect Model (ATAM) [Paul and Dredze, 2011]. The word clouds show the most probable words in each ailment, corresponding to (clockwise from top left) allergies, dental health, pain, and infuenza-like illness.

These techniques are not mutually exclusive, and it is not unreasonable to combine all three. Let’s illustrate this with an example. Suppose you want to use social media to learn how people are responding to the recent outbreak of Zika, a virus that can cause birth defects and had been rare in recent years until a widespread outbreak in 2015 originating in Brazil. (In fact, several researchers have done just that [Dredze et al., 2016c, Ghenai et al., 2017, Juric et al., 2017, Miller et al., 2017, Muppalla et al., 2017, Stefanidis et al., 2017].)

You decide to study this on Twitter, which captures a large and broad population. The first step is to collect tweets about Zika. There aren’t a lot of ways to refer to Zika without using its name (or perhaps its Portuguese translation, Zica, or its viral abbreviation, ZIKV). You might therefore start with a keyword filter for tweets containing “zika,” “zica,” or “zikv,” which would account for a tiny fraction of Twitter, but probably nearly all tweets about Zika, at least explicitly.

If you don’t already know what people discuss about Zika on Twitter (since it was not widely discussed until recently, after the outbreak), you might use a topic model as a starting point to identify the major themes of discussion in your dataset. After running and analyzing a topic model, you might find that in the context of Zika, people use Twitter to talk about the latest research, vaccine development, political and funding issues, pregnancy and birth issues, and travel bans and advisories.

Suppose you are interested in using social monitoring to learn how people are changing their behavior in response to the virus, so you decide to focus on topics related to pregnancy and travel. To narrow down to tweets on these topics, you could construct a list of additional keywords for filtering, maybe using the word associations learned by the topic model, or using your own ideas about relevant words, perhaps gained by manually reading a sample of tweets. Finally, if you need to identify tweets that can’t be captured with a simple keyword list (for example, you want to identify when someone mentions that they are personally changing travel plans, as opposed to more general discussion of travel advisories), then you should label some of the filtered tweets for relevance to your task and train a classifier to identify more such tweets.

Tools and Resources

A number of free tools exist for the machine learning tasks described above, although most require some programming experience. For a guide aimed at a public health audience rather than computer scientists, see Yoon et al. [2013]. For computationally oriented researchers, we recommend the following machine learning tools.

• scikit-learn (http://scikit-learn.org) is a Python library for a variety of general purpose machine learning tasks, including classification and validation.

• MALLET (http://mallet.cs.umass.edu) is a Java library for machine learning for text data, supporting document classification and topic modeling.

• NLTK (http://www.nltk.org) is a Python library for text processing, supporting tokenization and classification.

• Stanford Core NLP (https://stanfordnlp.github.io/CoreNLP/) is a set of natural language processing tools, including named entity recognition and dependency parsing.

• HLTCOE Concrete (http://hltcoe.github.io/) is a data serialization standard for NLP data that includes a variety of “concrete compliant” NLP tools.

• Twitter NLP (https://github.com/aritter/twitter_nlp) is a Python toolkit that implements some core NLP tools with models specifically trained on Twitter data.

• TweetNLP (http://www.cs.cmu.edu/~ark/TweetNLP/) is a toolkit implemented in Java and Python of text processing tools specifically for Twitter.

• Weka (http://www.cs.waikato.ac.nz/ml/weka/) is a machine learning software package that supports tasks like classification and clustering. It has a graphical interface, making it more user-friendly than the other tools.

4.1.2 TREND INFERENCE

We will now describe methods for extracting trends—levels of interest or activity across time intervals or geographic locations—from social media. First, we discuss how raw volumes of filtered content can be converted to trends by normalizing the counts. Second, we describe how filtered content can be used as predictors in more sophisticated statistical models to produce trend estimates. Examples of these two approaches, as applied to influenza surveillance, are contrasted in Figure 4.3.

Counting and Normalization

A simple method for extracting trends is to compute the volume of data filtered for relevance (Section 4.1.1) in each point (e.g., time period of location), for example the number of flu tweets per week [Chew and Eysenbach, 2010, Lamb et al., 2013, Lampos and Cristianini, 2010].

It is important to normalize the volume counts to adjust for variation over time and location. For example, the system of Lamb et al. [2013] normalizes influenza counts by dividing the volumes by the counts of a random sample of public tweets for the same location and time period. Normalization is especially important for comparing locations, as volumes are affected by regional differences in population and social media usage, but normalization is also important for comparing values across long time intervals, as usage of a social media platform inevitably changes over time.

Note that the search volume counts provided by Google Trends are already normalized, although normalization is plot dependent, and values cannot be compared between plots with establishing baselines for comparison. See Ayers et al. [2011b] for details.

Statistical Modeling and Regression

A more sophisticated approach to trend inference is to represent trends with statistical models. When a model is used to predict values, it is called regression. Regression models are used to fit data, such as social media volume, to “gold standard” values from an existing surveillance system, such as the influenza-like illness network from the Centers for Disease Control and Prevention (CDC).

Figure 4.3: Estimates of influenza prevalence derived from Twitter (blue) alongside the gold standard CDC rate (black). The dashed Twitter trend is the normalized count of influenza-related tweets, estimated with the method of Lamb et al. [2013]. The solid Twitter trend uses the normalized counts in a regression model to predict the CDC’s rates. The regression approach is based on research by Paul et al. [2014], in which an autoregressive model is trained on the Twitter counts as well as the previous three weeks of CDC data. Predictions for each season (segmented with vertical lines) are based on models trained on the remaining two seasons. The regression predictions, which incorporate lagged CDC data, are a closer fit to the gold standard curve than the counts alone.

The simplest type of regression model is a univariate (one predictor) linear model, which has the form: yi = b + βxi, for each point i, where a point is a time period such as week. For example, yi could be the CDC’s influenza prevalence at week i and xi could be the volume of flu-related social media activity in the same week [Culotta, 2010, Ginsberg et al., 2009]. The β value is the regression coefficient, interpreted as the slope of the line in a linear model, while b is an intercept. By plugging social media counts into a regression model, one can estimate the CDC’s values.

Other predictors can be included in regression models besides social media volume. A useful predictor is the trend itself: the previous week’s value is a good predictor of the current week, for example. A kth-order autoregressive (AR) model is a regression model whose predictors are the previous k values. For example, a second-order autoregressive model has the form yi = β1yi−1 + β2y_i−2. If predictors are included in addition to the time series data itself, such as the social media estimate xi, it is called an autoregressive exogenous (ARX) model. ARX models have been shown to outperform basic regression models for influenza prediction from social media [Achrekar et al., 2012, Paul et al., 2014].

A commonly used extension to the linear autoregressive model is the autoregressive integrated moving average (ARIMA) model, which assumes an underlying smooth behavior in the time series. These models have also been used for predicting influenza prevalence [Broniatowski et al., 2015, Dugas et al., 2013, Preis and Moat, 2014].

Подняться наверх