Читать книгу Social Monitoring for Public Health - Michael J. Paul - Страница 13

Оглавление

CHAPTER 3

Social Data

What constitutes “social data” and how can this type of data be used for public health monitoring? This chapter describes different types of social media, including well-known platforms like Twitter and Facebook, as well as other online platforms that may be less known but still valuable. We take “social data” as an broad term that includes a variety of the types of online data.

Before embarking on any social monitoring project, it is important to understand the social media landscape and the options for data sources. For example, it may surprise you to learn that Facebook, despite being the world’s largest social network, is rarely used for social monitoring. This is due to a variety of factors, including how the platform is used by people and the tools available for data collection. In contrast, Twitter dominates the social monitoring community, for reasons that are scientifically motivated (it provides a large and relatively representative sample) and reasons that are not (the data is free and convenient). We’ll compare different platforms, describing the affordances of different data sources and data types, their strengths and weaknesses, and their appropriateness for different health applications.

Finally, we briefly describe how to obtain data from a few popular platforms, with pointers to tools and tutorials.

3.1 WHAT IS SOCIAL DATA?

Social data refers to data that is created by people with the goal of sharing the data with others. For example, when people post messages or photos online to share with others, the text and images of the messages and photos are considered social data. Social media websites are the platforms through which social data is created. Examples of popular social media platforms include Twitter and Facebook. In general, social data is created by ordinary people, rather than professional writers or domain experts (e.g., clinicians).

This book will also use “social data” to refer to data that is created by people on the Web but not necessarily intended for social sharing, including search query data, because this data is prominently used for public health monitoring in addition to standard social media data. We also include data generated by people’s online activities other than intentionally posted messages, like location information—the “digital traces” left behind by people’s online behavior [Welser et al., 2008]. What we don’t generally include is data created specifically for researchers (like survey responses), although we do discuss how new technologies can facilitate collection of that kind of data.

3.2 MONITORING OF SOCIAL DATA

Social monitoring refers to the act of analyzing social data—either by manually reading the data, or automatically using computational tools—to learn about the world. Many people use social media platforms to publicly share information about what they are currently doing and thinking. By analyzing social data, it is possible to infer what is happening around the world and within populations.

Social monitoring is also a form of infodemiology, or information epidemiology, a term introduced by Eysenbach [2002] to describe the study of health determinants and the sharing of health information on the internet. Some of the earliest studies of health on the internet looked at the quality of health information on available websites [Davison, 1996, Impicciatore et al., 1997]. Social monitoring focuses on studying user-generated content to learn about a population.

It is possible to measure and understand all sorts of population opinions and behaviors through social monitoring. For example, social media can be monitored to measure consumer sentiment [Bian et al., 2016, Chamlertwat et al., 2012] and political sentiment [O’Connor et al., 2010]. Social monitoring has been used for forecasting sales [Asur and Huberman, 2010], predicting financial markets [Bollen et al., 2011], forecasting elections [Digrazia et al., 2013, Tumasjan et al., 2010], and estimating crowd sizes [Sinnott and Chen, 2016] and traffic congestion [Tse et al., 2017]. It is also a rich resource for interdisciplinary work, such as combining health and economics [Althouse et al., 2014, Ayers et al., 2012b] or health and politics [Dredze et al., 2017].

Social monitoring can be used to answer scientific questions, often in social science [Cioffi-Revilla, 2010, Lazer et al., 2009], including learning regional dialects [Eisenstein et al., 2010] and learning associations with personality traits [Schwartz et al., 2013].

We mention all these examples from different areas to give a taste of the enormous potential of social data. Of course, this book will focus on applications in public health, which we’ll survey throughout.

3.2.1 ACTIVE VS. PASSIVE MONITORING

Social monitoring can take an active or passive approach. Active monitoring requires explicit participation from users, while passive monitoring makes use of data already published by users, without requiring user interaction. An example of active monitoring is asking a sample of Twitter users which presidential candidate they favor, while a passive monitoring approach might analyze what Twitter users are writing about the candidates and infer sentiment toward candidates from the messages alone. Passive monitoring represents the bulk of research into social monitoring due to its relative ease and low cost.

We will focus on passive monitoring in this book, but will mention active approaches when relevant. See Hill et al. [2013] for a discussion on the utility of active approaches to public health surveillance compared to passive monitoring.

3.2.2 TYPES OF USERS

This book focuses on monitoring of people in a population, and we therefore focus on messages written by individuals. However, large swaths of social data are produced by organizations, bots, and spammers. These messages also have value in public health analyses. Heldman et al. [2013] considered how public health agencies can use social media and others discuss how the medical profession can use social media to communicate with the population [Moorhead et al., 2013, Thackeray et al., 2008]. McCorriston et al. [2015] introduce automated methods for differentiating Twitter accounts between individuals and organizations. Whether to detect and remove spammers in analyzing health messages in social media is the subject of debate [Allem and Ferrara, 2016, Kim et al., 2016], but certainly the presence of such messages should be considered when designing research studies.

3.3 TYPES OF PLATFORMS

Social data comes in many forms. Different online platforms and websites exist for different audiences and different purposes, and different platforms may be better suited for particular public health goals. This section will describe the different types of social media, and will discuss the types of health applications for which they are appropriate.

3.3.1 GENERAL-PURPOSE SOCIAL MEDIA

Blogs and Microblogs

Blogs (short for weblogs) are websites where individuals post messages and articles. Popular blogging platforms include Tumblr, WordPress, and Blogger.

Microblogs, such as Twitter and its Chinese counterpart, Sina Weibo, are social media platforms where users share brief “status updates.” The defining characteristic of microblogs is the short message length, in contrast to standard blogs. For example, Twitter messages can be no longer than 140 characters, a restriction that has been in place since its inception (though it has been loosened in various ways, first by using URL shortening, and more recently by not counting usernames toward the limit). Other platforms like Facebook have higher length limits, but messages still tend to be short. Smaller specialty platforms often have specific features that can change how they are used, such as the now defunct app YikYak which offered users anonymity [Koratana et al., 2016].

Microblogs are popular avenues for sharing news as well as the current status, beliefs, and activities of users, making them desirable for social monitoring. These platforms are intended for broadcasting information, often to a general, public audience. As such, content on these platforms is most often public, even though private accounts are possible.

Microblog users will often share messages written by others, called “retweets” in Twitter. Retweets are repostings of previously-published messages, rather than original content, and are often handled separately in systems that use social media data, since retweet activity can differ from original tweet activity.

Social Networks

Social networking platforms, such as Facebook and LinkedIn, are websites where users can connect with one another. In contrast to microblogs, where users typically publicly broadcast information, information published on social networking platforms is typically shared with a limited audience, such as friends and coworkers. Such websites are primarily designed for maintaining relationships and accounts are often private, although there are plenty of public accounts on Facebook that share general news. For these reasons, social networks are used less commonly for public health surveillance. However, social network data can be valuable for research that investigates social factors [Cobb et al., 2011].

Media Sharing Platforms

Some social media websites primarily serve as platforms for sharing visual media, such as videos (e.g., YouTube) and photos (e.g., Instagram, Flickr) [Vance et al., 2009]. Media can reveal population attitudes and behaviors, such as dietary choices revealed through photos [De Choudhury et al., 2016a] and drug use captured in videos [Morgan et al., 2010]. Additionally, the comments on sites like YouTube can be helpful for some health applications [Burton et al., 2012a, Freeman and Chapman, 2007].

General-purpose sharing websites include Reddit and Digg, where users submit links to other websites and articles, in addition to media such as images and videos. These websites are typically organized into different categories of discussion, such as politics and science. For example, Reddit is organized into thousands of topic-specific “subreddits” which are created and moderated by users.

For social monitoring, often the text comments and discussions on these platforms are used as data rather than the media itself.

3.3.2 DOMAIN-SPECIFIC SOCIAL MEDIA

In addition to general-purpose social media, some websites exist for more narrow purposes, including in the domain of health.

Review Websites

Online reviews are a focused type of social media, where users write reviews (usually including numeric scores) of products and services. Some review websites are quite broad, like Yelp, which is most commonly used to review businesses and restaurants. However, many review websites are domain-specific, including in the domain of health. For example, RateMDs.com is a website where people can post reviews of their doctors, and Drugs.com allows users to write reviews of medications.

In the domain of public health, researchers have monitored review websites to detect food poisoning outbreaks (from restaurant reviews) [Harrison et al., 2014] and drug side effects (from medication reviews) [Yates and Goharian, 2013].

Patient Communities

There are many web-based communities designed for patients to share information and experiences with one another. Online communities often use discussion forums—websites where users can create and respond to threads of conversation and discussion—as the mode of communication. Forums can be used to communicate information as well as to provide social support. Some patient forums also function as support groups, such as the websites DailyStrength and MedHelp.

A well-known patient community is PatientsLikeMe, where patients share information, especially regarding treatment options. In a famous experiment, hundreds of PatientsLikeMe members experimented with a novel treatment for amyotrophic lateral sclerosis (ALS) and shared their results, functioning as an informal, grassroots clinical trial [Wicks et al., 2011].

Additionally, some grassroots patient communities have developed in general-purpose platforms. For example, people create “group chats” on Twitter, where interested users agree on a particular hashtag and meeting time, and regularly have a conversation on a topic (e.g., cancer support chat on a weekly basis). Approximately 10% of Twitter group chats are about health [Cook et al., 2013].

3.3.3 SEARCH AND BROWSING ACTIVITY

While most social media data consists of information that is broadcast by users, other useful sources of information are activities performed by users on the Web.

One of the most common types of web activity is search. A query in a search engine suggests an interest in a topic, and thus by analyzing what people are searching for, researchers can infer what people are interested in. In public health, search data was most famously used by the Google Flu Trends system (Section 5.1.1), which estimates flu prevalence based on the number of people who are searching for flu-related information, under the assumption that those who are interested in flu are probably experiencing flu.

Search engines, such as Google, Bing, and Yahoo, log the queries that are searched by users. Raw query logs are private data, but some engines make aggregate statistics about query volumes publicly available through services such as Google Trends, described in Section 3.5.

Search data can also be analyzed from domain-specific websites, such as PubMed [Yoo and Mosa, 2015], often through private services not publicly obtainable, in contrast to Google Trends. For example, researchers from the National Cancer Institute partnered with Ask Jeeves to understand the information needs of cancer patients [Bader and Theofanos, 2003], and Santillana et al. [2014a] obtained search data from UpToDate, a disease database used by clinicians, to infer disease prevalence from clinician activity.

Another useful type of activity is browsing—a trace of the web pages that are visited by a user. Such data can come from detailed logs recorded by browsers such as Google Chrome and Microsoft Internet Explorer, but this data is private and, as such, is typically limited to researchers working at these companies [schraefel et al., 2009]. Outside researchers can obtain browser activity logs directly from the machines of participants, but obtaining such data requires the recruitment of consenting volunteers, and thus such research will typically be small scale [Fourney et al., 2014].

A public source of browsing data comes from Wikipedia, which public health researchers have utilized. Wikipedia publicly publishes timestamped logs of visits to each article, and this data can be used to measure levels of interest in articles such as “Influenza” or “Dengue fever” [McIver and Brownstein, 2014, Tausczik et al., 2012]. A limitation of Wikipedia logs as a data source is that they do not contain information about the locations of the readers, unlike most of these data sources (Section 3.4.2). Instead, researchers have used the language of articles as proxies for location [Generous et al., 2014], such as resolving French-language articles to France. However, this approach is coarse and unreliable, as many languages are widespread.

3.3.4 CROWDS AND MARKETS

Crowdsourcing is a method of obtaining feedback and assistance from large numbers of people using online services. For example, Amazon’s Mechanical Turk service is a general-purpose platform where users can post tasks to be completed, and other users are paid to complete the tasks [Buhrmester et al., 2011, Callison-Burch and Dredze, 2010, Goodman et al., 2013, Paolacci et al., 2010, Shapiro et al., 2013]. Crowdsourcing platforms allow for large-scale recruitment of workers to participate in projects.

Domain-specific crowdsourcing systems exist for health. For example, Flu Near You [Baltrusaitis et al., 2017, Crawley et al., 2014, Smolinski et al., 2015] is an application where users are periodically asked to share their health status—whether they are experiencing the flu—and this data can be used to estimate flu prevalence.

Crowd-based systems are a form of active monitoring, as discussed in Section 3.2.1. That is, learning about a population through crowdsourcing requires active involvement of the community, in contrast to the other platforms described above, in which publicly accessible information can be passively monitored.

Prediction markets are another way of harnessing crowds. Prediction markets are markets where future outcomes are traded—essentially, participants bet on what they think will happen—and prices can be used to measure the likelihood of different outcomes, according to the beliefs of the crowd. A few studies have shown prediction markets to be effective for forecasting diseases [Li et al., 2016, Polgreen et al., 2007, Tung et al., 2015].

3.3.5 COMPARISON OF PLATFORMS

The choice of data source in this diverse landscape is motivated by the type of application. General-purpose social media is a good source for identifying common, real-time trends. Topics such as influenza and vaccines are often discussed in the population at large, and so are well-represented in general-purpose social media. Furthermore, the nature of this type of platform provides real-time data, making it a good resource for studying current trends. Moreover, general-purpose platforms include discussion on a variety of topics outside of health, which allows one to study how people’s habits and behaviors across a variety of domains interact with their health.

There are many general-purpose social media platforms, each with their own characteristics, features, and user populations. See Osborne and Dredze [2014] for a comparison of some of these platforms.

In contrast, domain-specific social media is best suited for an in-depth study of a specific health condition, especially those that are not common in the general population. The communities surrounding specific diseases and health topics provide rich details into the thoughts and behaviors of people engaged with the particular topic. Furthermore, many of these forums go back years, allowing for analysis of trends over a long period of time.

Search activity provides both real-time and historical capabilities. For example, Google Trends1 provides historical data back to 2004, as well as daily updates of search activity (and in some cases, hourly). Additionally, search queries cover a wide range of subjects and so can provide information on low-prevalence health conditions. However, search activity often misses the “why” of health behaviors. While we can sometimes ascertain the reason behind a query based on the keywords in a search, often times it is impossible to know the user intention. In short, search traffic can answer “what,” but not always “why.” Additionally, because search activity in the form publicly available to researchers is aggregated across users, we cannot undertake the type of user analysis, or the linking of multiple queries to a single user, that may be needed for fully understanding the data.

We note that not only are different platforms used in different ways, but they are used for different topics of health discussion. De Choudhury et al. [2014b] compared the prevalence of mentions of health issues in tweets vs. search query logs, finding that more serious and stigmatizing conditions (e.g., sexually transmitted disease) are more prevalent in search logs than tweets, while certain benign conditions (e.g., jet lag) are more prevalent in Twitter. The authors thus suggest using caution when using Twitter to study high-stigma conditions, due to the apparent self-censorship being applied in public social media. However, a study of privacy settings in Facebook did not find large differences in content posted by public accounts vs. private accounts, which suggests that public social media data may not be as biased as previously believed [Fiesler et al., 2017].

Finally, the users of different platforms have different demographic characteristics; see Duggan et al. [2015] for a summary.

3.4 TYPES OF DATA

We will now discuss the various forms of data available from social media, such as text (e.g., from tweets or search queries), locations (e.g., precise coordinates or geographic entities), and social network information (e.g, friends and followers).

3.4.1 CONTENT

The bulk of web content is in the form of text. Text can often be analyzed by searching for messages containing particular words or phrases of interest. More sophisticated analyses of text require natural language processing, described in Section 4.1.1, which is a computational approach to automating linguistic analysis of language. Most social monitoring uses text, and this book will focus on text.

Other content may come in the form of images (such as through Instagram) and video (such as through YouTube), which are often also accompanied by text in the form of captions, descriptions, and user comments. Images and video can be automatically analyzed and categorized using computer vision, a computational approach to analyzing imagery. For example, Garimella et al. [2016] found that automatically extracted tags of Instagram images can be useful for some health applications, like detecting excessive drinking. However, these types of tools are limited, so most research using this type of media have relied on manual analysis by people.

3.4.2 METADATA

Metadata, such as the time and location of messages, are crucial for social media analysis, in order to understand variation in populations.

Time

Almost all data on the Web is timestamped, and this information is typically trivial to collect. Often individual messages will come with timestamps, typically at the granularity of seconds. For some types of data, individual messages are unavailable, and only aggregate information over an interval of time, such as a day or month, is available. This is the case with services like Google Trends, which do not share individual search queries, but will provide the number of queries issued within various time intervals.

Location

Obtaining the location of a message—that is, the location of the author who wrote it—is often more difficult to obtain than time information, yet is often critical for health applications [Burton et al., 2012b]. Sometimes location information is provided by the social media platform. For example, Twitter allows users to provide detailed location information in the form of latitude and longitude coordinates, which are sometimes available when users participate with a GPS-enabled device. Additionally, users can tag a location in their tweet, such as a city, neighborhood or specific point of interest. Unfortunately, this type of location data is rare; only a small percentage of tweets contain coordinates. For example, roughly 1–3% of Twitter messages are geocoded.

To increase the amount of geolocated data, researchers have developed a range of methods for automatically inferring location from available user data [Han et al., 2014]. There are a variety of methods for inferring location information, and we summarize these techniques in the next chapter in Section 4.3.1.

Location stability Some location data is dynamic, meaning that it is updated to the current location for each message that is sent. GPS-tagged tweets and IP address geolocation are dynamic: they describe the location of the user when the activity was performed. Other information is static and may stay the same as a person moves around. For example, the location field of a user profile typically describes a user’s primary home location, and does not change as a person travels. In general, the location of a user can be difficult to quantify, as locations can change and their accuracy can be subjective. For example, the identification of a user as residing in New York City, but who actually resides across the river in New Jersey, may be sufficient for many applications, even though the identification has the wrong state. Similarly, for a user who resides in El Paso, Texas, United States and works in Juarez, Chihuahua, Mexico, either city would be an accurate location despite being in different countries. In contrast, confusing a state or country would be a major geolocation error in most cases. See Dredze et al. [2013] for some of the challenges with evaluating geolocation.

3.4.3 SOCIAL NETWORK STRUCTURE

Another useful type of data is the network structure of a social platform, meaning the links or relationships between platform users. Social network structure is important for certain types of public health surveillance, such as predicting the spread of disease [Sadilek et al., 2012a,b] or understanding social support for healthy behaviors such as smoking cessation [Cobb et al., 2011].

Many platforms explicitly encode relationships between users. For example, in Facebook, users become “friends” upon mutual agreement. In Twitter, users “follow” other users, meaning that they subscribe to read the content of their followers. Following a user on Twitter is an asymmetric act and does not require mutual consent.

It is also possible to implicitly construct a social network. For example, one might infer a relationship between users if they communicate on a social network [Rao et al., 2010]. Even if explicit network information is available, implicit communication networks may also serve as a useful alternative, as these networks imply a different type of relationship. For example, Twitter users who communicate with each other may have a stronger relationship than users who follow each other but do not communicate. An “affiliation network” connects two users who share a common activity, like reading the same article or purchasing the same product [Mishra et al., 2013].

Network relationships can be either directed or undirected. Undirected relationships are symmetric, such as “friend” relationships in Facebook. Directed relationships flow from one user to another, such as a “follow” relationship in Twitter, in which one user follows another. Directed relationships can always be treated as undirected, if needed for a task, by removing the directionality.

3.5 DATA COLLECTION

We provide a brief summary of some of the most popular data sources in the social media research community and their associated APIs (application program interfaces) to serve as a starting guide. We encourage readers to visit the developer pages of the platform of interest for more information. Working directly with an API may be beyond the ability of researchers without technical training, although there are some guides written specifically for non-technical researchers (see Denecke et al. [2013], Yoon et al. [2013], Schwartz and Ungar [2015]). Some of the platforms described below make data available in easy-to-use formats, such as comma-separate values (CSV), usually including a rich variety of metadata, and others sell data in formats suitable for non-technical researchers.

Twitter makes it very easy to obtain a wide variety of data using their API.2 The streaming API provides a constant real-time data feed (approximately 1% of all tweets), while the REST API allows for searching through (limited) historical data. This allows researchers to collect targeted datasets based on specific keywords, locations, or users. There are a variety of tutorials and tools available for quickly starting a Twitter data collection.3 Commercial options for larger data collection are available through Gnip, Twitter’s enterprise API platform, which can provide samples larger than 1% and historical data matching specific queries. Gnip also provides data from other platforms, including Instagram and YouTube.4

Facebook also has a robust API that allows for a number of different data queries,5 including the Graph API, which is the primary way to read from the Facebook social graph. However, unlike Twitter, most Facebook data is not publicly available, and so it is not available unless one has explicit permissions from the data author. Additionally, Facebook provides various search methods but not a streaming method, making it difficult to obtain random samples of data. An alternate approach is to develop a Facebook app that obtains explicit sharing permissions from users. While time consuming to develop and promote, investments in Facebook apps can yield valuable datasets [De Choudhury et al., 2014a, Schwartz et al., 2013].

Reddit is a popular online forum and content-sharing service, where users can submit content and leave comments. It is one of the most popular forum sites, and therefore hosts content on a wide range of topics including health. Reddit provides an API that makes it easy to download content.6

Google Trends provides aggregated keyword search data going back to 2004, with the ability to show trends specific to a location, time or category.7 The site also suggests related queries, so that users can expand their search to find other queries relevant to their topic of interest. Google allows data to be exported in CSV format. Bing provides a similar tool, though it is aimed at advertisers.8

Additionally, some health-specific data resources are described in Section 5.1.4 for the purpose of disease surveillance.

1 https://www.google.com/trends/

2 https://dev.twitter.com/

3 For example, see http://socialmedia-class.org/twittertutorial.html and https://github.com/mdredze/twitter_stream_downloader.

4 https://gnip.com/sources/

5 https://developers.facebook.com/

6 https://www.reddit.com/dev/api

7 https://www.google.com/trends/

8 http://www.bing.com/toolbox/keywords

Social Monitoring for Public Health

Подняться наверх