Читать книгу Natural Language Processing for Social Media - Diana Inkpen - Страница 11



Introduction to Social Media Analysis


Social media is a phenomenon that has recently expanded throughout the world and quickly attracted billions of users. This form of electronic communication through social networking platforms allows users to generate its content and share it in various forms of information, personal words, pictures, audio, and videos. Therefore, social computing is formed as an emerging area of research and development that includes a wide range of topics such as Web semantics, artificial intelligence, natural language processing, network analysis, and Big Data analytics.

Over the past few years, online social networking sites (Facebook, Twitter, YouTube, Flickr, MySpace, LinkedIn, Metacafe, Vimeo, etc.) have revolutionized the way we communicate with individuals, groups, and communities, and have altered everyday practices [Boyd and Ellison, 2007].

The broad categories of social media platforms are: content-sharing sites, forums, blogs, and microblogs. On content sharing sites (such as Facebook, Instagram, Foursquare, Flickr, YouTube) people exchange information, messages, photos, videos, or other types of content. On Web user forums (such as StackOverflow, CNET forums, Apple Support) people post specialized information, questions, or answers. Blogs (such as Gizmodo, Mashable, Boing Boing, and many more) allow people to post messages and other content and to share information and opinions. Micro-blogs (such as Twitter, Sina Weibo, Tumblr) are limited to short texts for sharing information and opinions. The modalities of sharing content in order: posts; comments to posts; explicit or implicit connections to build social networks (friend connections, followers, etc.); cross-posts and user linking; social tagging; likes/favorites/starring/voting/rating/etc.; author information; and linking to user profile features.1 In Table 1.1, we list more details about social media platforms and their characteristics and types of content shared [Barbier et al., 2013].

Social media statistics for January 2014 have shown that Facebook has grown to more than 1 billion active users, adding more than 200 million users in a single year. Statista,2 the world’s largest statistics portal, announced the ranking for social networks based on the number of active users. As presented in Figure 1.1, the ranking shows that Qzone took second place with more than 600 million users. Google+, LinkedIn, and Twitter completed the top 5 with 300 million, 259 million, and 232 million active users, respectively.

Table 1.1: Social media platforms and their characteristics

Statista also provided the growth trend for both Facebook and LinkedIn, illustrated in Figure 1.2 and Figure 1.3, respectively. Figure 1.2 shows that Facebook, by reaching 845 million users at the end of 2011, totaled 1,228 million users by the end of 2013. As depicted in Figure 1.3, LinkedIn also reached 277 million users by the end of 2013, whereas it only had 145 million users at the end of 2011. Statista also calculated the annual income for both Facebook and LinkedIn, which in 2013 totalled US$7,872 and US$1,528 million, respectively.

Figure 1.1: Social networks ranked by the number of active users as of January 2014 (in millions) provided by Statista.

Figure 1.2: Number of monthly active Facebook users from the third quarter of 2008 to the first quarter of 2014 (in millions) provided by Statista.

Social computing is an emerging field that focuses on modeling, analysis, and monitoring of social behavior on different media and platforms to produce intelligent applications. Social media is the use of electronic and Internet tools for the purpose of sharing and discussing information and experiences with other human beings in efficient ways [Moturu, 2009]. Various social media platforms such as social networks, forums, blogs, and micro-blogs have recently evolved to ensure the connectivity, collaboration, and formation of virtual communities. While traditional media such as newspapers, television, and radio provide unidirectional communication from business to consumer, social media services have allowed interactions among users across various platforms. Social media have therefore become a primary source of information for business intelligence.

Figure 1.3: Number of LinkedIn members from the first quarter of 2009 to the first quarter of 2014 (in millions) provided by Statista.

There are several means of interaction in social media platforms. One of the most important is via text posts. The natural language processing (NLP) of traditional media such as written news and articles has been a popular research topic over the past 25 years. NLP typically enables computers to derive meaning from natural language input using the knowledge from computer science, artificial intelligence, and linguistics.

NLP for social media text is a new research area, and it requires adapting the traditional NLP methods to these kinds of texts or developing new methods suitable for information extraction and other tasks in the context of social media.

There are many reasons why the “traditional” NLP are not good enough for social media texts, such as their informal nature, the new type of language, abbreviations, etc. Section 1.3 will discuss these aspects in more detail.

A social network is made up of a set of actors (such as individuals or organizations) and a set of binary relations between these actors (such as relationships, connections, or interactions). From a social network perspective, the goal is to model the structure of a social group to identify how this structure influences other variables and how structures change over time. Semantic analysis in social media (SASM) is the semantic processing of the text messages as well as of the meta-data, in order to build intelligent applications based on social media data.

SASM helps develop automated tools and algorithms to monitor, capture, and analyze the large amounts of data collected from social media in order to predict user behavior or extract other kinds of information. If the amount of data is very large, techniques for “big data” processing need to be used, such as online algorithms that do not need to store all the data in order to update the models based on the incoming data.

In this book, we focus on the analysis of the textual data from social media, via new NLP techniques and applications. Recently, workshops such as the EACL 2014 Workshop on Language Analysis in Social Media [Farzindar et al., 2014], the NAACL/HLT 2013 workshop on Language Analysis in Social Media [Farzindar et al., 2013], and the EACL 2012 Workshop for Semantic Analysis in Social Media [Farzindar and Inkpen, 2012] have been increasingly focusing on NLP techniques and applications that study the effect of social media messages on our daily lives, both personally and professionally.

Social media textual data is the collection of openly available texts that can be obtained publicly via blogs and micro-blogs, Internet forums, user-generated FAQs, chat, podcasts, online games, tags, ratings, and comments. Social media texts have several properties that make them different than traditional texts, because the nature of the social conversations, posted in realtime. Detecting groups of topically related conversations is important for applications, as well as detection emotions, rumors, and incentives. Determining the locations mentioned in the messages or the locations of the users can also add valuable information. The texts are unstructured and are presented in many formats and written by different people in many languages and styles. Also, the typographic errors and chat slang have become increasingly prevalent on social networking sites like Facebook and Twitter. The authors are not professional writers and their postings are spread in many places on the Web, on various social media platforms.

Monitoring and analyzing this rich and continuous flow of user-generated content can yield unprecedentedly valuable information, which would not have been available from traditional media outlets. Semantic analysis of social media has given rise to the emerging discipline of big data analytics, which draws from social network analysis, machine learning, data mining, information retrieval, and natural language processing [Melville et al., 2009].

Figure 1.4 shows a framework for semantic analysis in social media. The first step is to identify issues and opportunities for collecting data from social networks. The data can be in the form of stored textual information (the big datacould be stored in large and complex databases or text files), it could be dynamic online data collection processed in real time, or it could be retrospective data collection for particular needs. The next step is the SASM pipeline, which consists of specific NLP tools for the social media analysis and data processing. Social media data is made up of large, noisy, and unstructured datasets. SASM transforms social media data to meaningful and understandable messages through social information and knowledge. Then, SASM analyzes the social media information in order to produce social media intelligence. Social media intelligence can be shared with users or presented to decision-makers to improve awareness, communication, planning, or problem solving. The presentation of analyzed data by SASM could be completed by data visualization methods.

Figure 1.4: A framework for semantic analysis in social media, where NLP tools transform the data into intelligence.


The automatic processing of social media data needs to design appropriate research methods for applications such as information extraction, automatic categorization, clustering, indexing data for information retrieval, and statistical machine translation. The sheer volume of social media data and the incredible rate at which new content is created makes monitoring, or any other meaningful manual analysis, unfeasible. In many applications, the amount of data is too large for effective real-time human evaluation and analysis of the data for a decision maker.

Social media monitoring is one of the major applications in SASM. Traditionally, media monitoring is defined as the activity of monitoring and tracking the output of the hard copy, online, and broadcast media which can be performed for a variety of reasons, including political, commercial, and scientific. The huge volume of information provided via social media networks is an important source for open intelligence. Social media make the direct contact with the target public possible. Unlike traditional news, the opinion and sentiment of authors provide an additional dimension for the social media data. The different sizes of source documents—such as a combination of multiple tweets and blogs—and content variability also render the task of analyzing social media documents difficult.

In social media, the real-time event search or event detection The search queries consider multiple dimensions, including spatial and temporal. In this case, some NLP methods such as information retrieval and summarization of social data in the form of various documents from multiple sources become important in order to support the event search and the detection of relevant information.

The semantic analysis of the meaning of a day’s or week’s worth of conversations in social networks for a group of topically related discussions or about a specific event presents the challenges of cross-language NLP tasks. Social media—related NLP methods that can extract information of interest to the analyst for preferential inclusion also lead us to domain-based applications in computational linguistics.


The application of existing NLP techniques to social media from different languages and multiple resources faces several additional challenges; the tools for text analysis are typically designed for specific languages. The main research issue therefore lies in assessing whether language-independence or language-specificity is to be preferred. Users publish content not only in English, but in a multitude of languages. This means that due to the language barrier, many users cannot access all available content. The use of machine translation technology can help bridge the language gap in such situations. The integration of machine translation and NLP tools opens opportunities for the semantic analysis of text via cross-language processing.


The huge volume of publicly available information on social networks and on the Web can benefit different areas such as industry, media, healthcare, politics, public safety, and security. Here, we can name a few innovative integrations for social media monitoring, and some model scenarios of government-user applications in coordination and situational awareness. We will show how NLP tools can help governments interpret data in near real-time and provide enhanced command decision at the strategic and operational levels.


There is great interest on the part of industry in social media data monitoring. Social media data can dramatically improve business intelligence (BI). Businesses could achieve several goals by integrating social data into their corporate BI systems, such as branding and awareness, customer/prospect engagement, and improving customer service. Online marketing, stock market prediction, product recommendation, and reputation management are some examples of real-world applications for SASM.

Media and Journalism

The relationship between journalists and the public became closer thanks to social networking platforms. The recent statistics, published by a 2013 social journalism study, show that 25% of major information sources come from social media data.3 The public relations professionals and journalists use the power of social media to gather the public opinion, perform sentiment analysis, implement crisis monitoring, perform issues- or program-based media analysis, and survey social media.


Over time, social media became part of common healthcare. The healthcare industry uses social media tools for building community engagement and fostering better relationships with their clients. The use of Twitter to discuss recommendations for providers and consumers (patients, families, or caregivers), ailments, treatments, and medication is only one example of social media in healthcare. This was initially referred to as social health. Medical forums appeared due to the needs of the patients to discuss their feelings and experiences.

This book will discuss how NLP methods on social media data can help develop innovative tools and integrate appropriate linguistic information in order to allow better health monitoring (such as disease spread) or availability of information and support for patients.


Online monitoring can help keep track of mentions made by citizens across the country and of international, national, or local opinion about political parties. For a political party, organizing an election campaign and gaining followers is crucial. Opinion mining, awareness of comments and public posts, and understanding statements made on discussion forums can give political parties a chance to get a better idea of the reality of a specific event, and to take the necessary steps to improve their positions.

Defense and Security

Defense and security organizations are greatly interested in studying these sources of information and summaries to understand situations and perform sentiment analysis of a group of individuals with common interests, and also to be alerted against potential threats to defense and public safety. In this book, we will discuss the issue of information flow from social networks such as MySpace, Facebook, Skyblog, and Twitter. We will present methods for information extraction in Web 2.0 to find links between data entities, and to analyze the characteristics and dynamism of networks through which organizations and discussions evolve. Social data often contain significant information hidden in the texts and network structure. Aggregate social behavior can provide valuable information for the sake of national security.


The information presented in social media, such as online discussion forums, blogs, and Twitter posts, is highly dynamic and involves interaction among various participants. There is a huge amount of text continuously generated by users in informal environments.

Standard NLP methods applied to social media texts are therefore confronted with difficulties due to non-standard spelling, noise, and limited sets of features for automatic clustering and classification. Social media are important because the use of social networks has made everybody a potential author, so the language is now closer to the user than to any prescribed norms [Beverungen and Kalita, 2011, Zhou and Hovy, 2006]. Blogs, tweets, and status updates are written in an informal, conversational tone—often more of a “stream of consciousness” than the carefully thought out and meticulously edited work that might be expected in traditional print media. This informal nature of social media texts presents new challenges to all levels of automatic language processing.

At the surface level, several issues pose challenges to basic NLP tools developed for traditional data. Inconsistent (or absent) punctuation and capitalization can make detection of sentence boundaries quite difficult—sometimes even for human readers, as in the following tweet: “#qcpoli enjoyed a hearty laugh today with #plq debate audience for @jflisee #notrehome tune was that the intended reaction?” Emoticons, incorrect or non-standard spelling, and rampant abbreviations complicate tokenization and part-of-speech tagging, among other tasks. Traditional tools must be adapted to consider new variations such as letter repetition (“heyyyyyy”), which are different from common spelling errors. Grammaticality, or frequent lack thereof, is another concern for any syntactic analyses of social media texts, where fragments can be as commonplace as actual full sentences, and the choice between “there,” “they are,” “they’re,” and “their” can seem to be made at random.

Social media are also much noisier than traditional print media. Like much else on the Internet, social networks are plagued with spam, ads, and all manner of other unsolicited, irrelevant, or distracting content. Even by ignoring these forms of noise, much of the genuine, legitimate content on social media can be seen as irrelevant with respect to most information needs. André et al. [2012] demonstrate this in a study that assesses user-perceived value of tweets. They collected over 40,000 ratings of tweets from followers, in which only 36% of tweets were rated as “worth reading,” while 25% were rated as “not worth reading.” The least valued tweets were so-called presence maintenance posts (e.g., “Hullo twitter!”). Pre-processing to filter out spam and other irrelevant content, or models that are better capable of coping with noise are essential in any language-processing effort targeting social media.

Several characteristics of social media text are of particular concern to NLP approaches. The particularities of a given medium and the way in which that medium is used can have a profound effect on what constitutes a successful summarization approach. For example, the 140-character limit imposed on Twitter posts makes for individual tweets that are rather contextually impoverished compared to more traditional documents. However, redundancy can become a problem over multiple tweets, due in part to the practice of retweeting posts. Sharifi et al. [2010] note the redundancy of information as a major issue with microblog summarization in their experiments with data mining techniques to automatically create summary posts of Twitter trending topics.

A major challenge facing detection of events of interest from multiple Twitter streams is therefore to separate the mundane and polluted information from interesting real-world events. In practice, highly scalable and efficient approaches are required for handling and processing the increasingly large amount of Twitter data (especially for real-time event detection). Other challenges are inherent to Twitter design and usage. These are mainly due to the shortness of the messages: the frequent use of (dynamically evolving) informal, irregular, and abbreviated words, the large number of spelling and grammatical errors, and the use of improper sentence structure and mixed languages. Such data sparseness, lack of context, and diversity of vocabulary make the traditional text analysis techniques less suitable for tweets [Metzler et al., 2007]. In addition, different events may enjoy different popularity among users, and can differ significantly in content, number of messages and participants, time periods, inherent structure, and causal relationships [Nallapati et al., 2004].

Across all forms of social media, subjectivity is an ever-present trait. While traditional news texts may strive to present an objective, neutral account of factual information, social media texts are much more subjective and opinion-laden. Whether or not the ultimate information need lies directly in opinion mining and sentiment analysis, subjective information plays a much greater role in semantic analysis of social texts.

Topic drift is much more prominent in social media than in other texts, both because of the conversational tone of social texts and the continuously streaming nature of social media. There are also entirely new dimensions to be explored, where new sources of information and types of features need to be assessed and exploited. While traditional texts can be seen as largely static and self-contained, the information presented in social media, such as online discussion forums, blogs, and Twitter posts, is highly dynamic and involves interaction among various participants. This can be seen as an additional source of complexity that may hamper traditional summarization approaches, but it is also an opportunity, making available additional context that can aid in summarization or making possible entirely new forms of summarization. For instance, Hu et al. [2007a] suggest summarizing a blog post by extracting representative sentences using information from user comments. Chua and Asur [2012] exploit temporal correlation in a stream of tweets to extract relevant tweets for event summarization. Lin et al. [2009] address summarization not of the content of posts or messages, but of the social network itself by extracting temporally representative users, actions, and concepts in Flickr data.

As we mentioned, standard NLP approaches applied to social media data are therefore confronted with difficulties due to non-standard spelling, noise, limited sets of features, and errors. Therefore some NLP techniques, including normalization, term expansion, improved feature selection, and noise reduction, have been proposed to improve clustering performance in Twitter news [Beverungen and Kalita, 2011]. Identifying proper names and language switch in a sentence would require rapid and accurate name entity recognition and language detection techniques. Recent research efforts focus on the analysis of language in social media for understanding social behavior and building socially aware systems. The goal is the analysis of language with implications for fields such as computational linguistics, sociolinguistics, and psycholinguistics. For example, Eisenstein [2013a] studied the phonological variation and factors when transcribed into social media text.

Several workshops organized by the Association for Computational Linguistics (ACL) and special issues in scientific journals dedicated to semantic analysis in social media show how active this research field is. We enumerate some of them here (we also mentioned them in the Preface):

• The EACL 2014 Workshop Language Analysis in Social Media (LASM 2014)4

• The NAACL/HLT 2013 Workshop on Language Analysis in Social Media (LASM 2013)5

• The EACL 2012 Workshop on Semantic Analysis in Social Media (SASM 2012)6

• The NAACL/HLT 2012 Workshop on Language in Social Media (LSM 2012)7

• The ACL/HLT 2011 Workshop on Language in Social Media (LSM 2011)8

• The WWW 2015 Workshop on Making Sense of Microposts9

• The WWW 2014 Workshop on Making Sense of Microposts10

• The WWW 2013 Workshop on Making Sense of Microposts11

• The WWW 2012 Workshop on Making Sense of Microposts12

• The ESWC 2011 Workshop on Making Sense of Microposts13

• The COLING 2014 Workshop on Natural Language Processing for Social Media (SocialNLP)14

• The IJCNLP 2013 Workshop on Natural Language Processing for Social Media (SocialNLP)15

In this book, we will cite many papers from conferences such as ACL, WWW, etc.; many workshop papers from the above-mentioned workshops and more; several books; and many journal papers from various relevant journals.


Our goal is to focus on innovative NLP applications (such as opinion mining, information extraction, summarization, and machine translation), tools, and methods that integrate appropriate linguistic information in various fields such as social media monitoring for healthcare, security and defense, business intelligence, and politics. The book contains four major chapters.

Chapter 1: This chapter highlights the need for applications that use social media messages and meta-data. We also discuss the difficulty of processing social media data vs. traditional texts such as news articles and scientific papers.

Chapter 2: This chapter discusses existing linguistic pre-processing tools such as tokenizers, part-of-speech taggers, parsers, and named entity recognizers, with a focus on their adaptation to social media data. We briefly discuss evaluation measures for these tools.

Chapter 3: This chapter is the heart of the book. It presents the methods used in applications for semantic analysis of social network texts, in conjunction with social media analytics as well as methods for information extraction and text classification. We focus on tasks such as: geo-location detection, entity linking, opinion mining and sentiment analysis, emotion and mood analysis, event and topic detection, summarization, machine translation, and other tasks. They tend to pre-process the messages with some of the tools mentioned in Chapter 2 in order to extract the knowledge needed in the next processing levels. For each task, we discuss the evaluation metrics and any existing test datasets.

Chapter 4: This chapter presents higher-level applications that use some of the methods from Chapter 3. We look at: healthcare applications, financial applications, predicting voting intentions, media monitoring, security and defense applications, NLP-based information visualization for social media, disaster response applications, NLP-based user modeling, and applications for entertainment.

Chapter 5: This chapter discusses chapter complementary aspects such as data collection and annotation in social media, privacy issues in social media, spam detection in order to avoid spam in the collected datasets, and we describe some of the existing evaluation benchmarks that make available data collected and annotated for various tasks.

Chapter 6: The last chapter summarizes the methods and applications described in the preceding chapters. We conclude with a discussion of the high potential for research, given the social media analysis needs of end-users.

As mentioned in the Preface, the intended audience of this book is researchers that are interested in developing tools and applications for automatic analysis of social media texts. We assume that the readers have basic knowledge in the area of natural language processing and machine learning. Nonetheless, we will try to define as many notions as we can, in order to facilitate the understanding for beginners in these two areas. We also assume basic knowledge of computer science in general.


In this chapter, we reviewed the structure of social network and social media data as the collection of textual information on the Web. We presented semantic analysis in social media as a new opportunity for big data analytics and for intelligent applications. Social media monitoring and analyzing of the continuous flow of user-generated content can be used as an additional dimension which contains valuable information that would not have been available from traditional media and newspapers. In addition, we mentioned the challenges with social media data, which are due to their large size, and to their noisy, dynamic, and unstructured nature.

1 http://people.eng.unimelb.edu.au/tbaldwin/pubs/starsem2014.pdf

2 http://www.statista.com/

3 http://www.cision.com/uk/files/2013/10/social-journalism-study-2013.pdf

4 https://aclweb.org/anthology/W/W14/#1300

5 https://aclweb.org/anthology/W/W13/#1100

6 https://aclweb.org/anthology/W/W12/#2100

7 https://aclweb.org/anthology/W/W12/#2100

8 https://aclweb.org/anthology/W/W11/#0700

9 http://www.scc.lancs.ac.uk/microposts2015/

10 http://www.scc.lancs.ac.uk/microposts2014/

11 http://oak.dcs.shef.ac.uk/msm2013/

12 htpp://ceur-ws.org/Vol-838/

13 htpp://ceur-ws.org/Vol-718/

14 https://sites.google.com/site/socialnlp/2nd-socialnlp-workshop

15 https://sites.google.com/site/socialnlp/1st-socialnlp-workshop

Natural Language Processing for Social Media

Подняться наверх