Читать книгу Natural Language Processing for Social Media - Diana Inkpen - Страница 12
ОглавлениеCHAPTER 2
Linguistic Pre-processing of Social Media Texts
2.1 INTRODUCTION
In this chapter, we discuss current Natural Language Processing (NLP) linguistic pre-processing methods and tools that were adapted for social media texts. We survey the methods used for adaptation to this kind of texts. We briefly define the evaluation measures used for each type of tool in order to be able to mention the state-of-the-art results.
In general, evaluation in NLP can be done in several ways:
• manually, by having humans judge the output of each tool;
• automatically, on test data that humans have annotated with the expected solution ahead of time; and
• task-based, by using the tools in a task and evaluating how much they contribute to the success in the task.
We primarily focus on the second approach here. It is the most convenient since it allows the automatic evaluation of the tools repeatedly after changing/improving their methods, and it allows comparing different tools on the same test data. Care should be taken when human judges annotate data. There should be at least two annotators that are given proper instructions on what and how to annotate (in an annotation manual). There needs to be a reasonable agreement rate between the two or more annotators, to ensure the quality of the obtained data. When there are disagreements, the expected solution will be obtained by resolving the disagreements by taking a vote (if there are three annotators or more, an odd number), or by having the annotators discuss until they reach an agreement (if there are only two annotators, or an even number). When reporting the inter-annotator agreement for a dataset, the kappa statistic also needs to be reported, in order to compensate the obtained agreement for possible agreements due to chance [Artstein and Poesio, 2008, Carletta, 1996].
NLP tools often use supervised machine learning, and the training data are usually annotated by human judges. In such cases, it is convenient to keep aside some of the annotated data for testing and to use the remaining data to train the models. Many of the methods discussed in this book use machine learning algorithms for automatic text classification. That is why we give a very brief introduction here. See, e.g., [Witten and Frank, 2005] for details of the algorithms and [Sebastiani, 2002] for how they can be applied to text data.
A supervised text classification model predicts the label c of an input x, where x is a vector of feature values extracted from document d. The class c can take two or more possible values from a specified set (or even continuous numeric values, in which case the classifier is called a regression model). The training data contain document vectors for which the classes are provided. The classifier uses the training data to learn associations between features or combinations of features that are strongly associated with one of the classes but not with the other classes. In this way, the trained model can make predictions for unseen test data in the future. There are many classification algorithms. We name three classifiers most popular in NLP tasks.
Decision trees take one feature at a time, compute its power of discriminating between the classes and build a tree with the most discriminative features in the upper part of the tree; decision trees are useful because the models can be easily understood by humans. Naïve Bayes is a classifier that learns the probabilities of association between features and classes; these models are used because they are known to work well with text data (see a more detailed description in Section 2.8.1). Support Vector Machines (SVM) compute a hyper plane that separates two classes and they can efficiently perform nonlinear classification using what is called a kernel to map the data into a high-dimensional feature space where it become linearly separable [Cortes and Vapnik, 1995]; SVMs are probably the most often used classifiers due to their high performance on many tasks.
A sequence-tagging model can be seen as a classification model, but fundamentally differs from a conventional one, in the sense that instead of dealing with a single input x and a single label c each time, it predicts a sequence of labels c = (c1, c2,…, cn) based on a sequence of inputs x = (x1, x2,…, xn) and the predictions from the previous steps. It was applied with success in natural language processing (for sequential data such as sequences of part-of-speech tags, discussed in the previous chapter) and in bioinformatics (for DNA sequences). There exist a number of sequence-tagging models, including Hidden Markov Model (HMM) [Baum and Petrie, 1966], Conditional Random Field (CRF) [Lafferty et al., 2001], and Maximum Entropy Markov Model (MEMM) [Berger et al., 1996].
The remainder of this chapter is structured as follows. Section 2.2 discusses generic methods of adapting NLP tools to social media texts. The next five sections discuss NLP tools of interest: tokenizers, part-of-speech taggers, chunkers, parsers, and named entity recognizers, as well as adaptation techniques for each. Section 2.7 enumerates the existing toolkits that were adapted to social media texts in English. Section 2.8 discusses multi-lingual aspects and language identification issues in social media. Section 2.9 summarizes this chapter.
2.2 GENERIC ADAPTATION TECHNIQUES FOR NLP TOOLS
NLP tools are important because they need to be used before we can build any applications that aim to understand texts or extract useful information from texts. Many NLP tools are now available, with acceptable levels of accuracy on texts that are similar to the types of texts used for training the models embedded in these tools. Most of the tools are trained on carefully edited texts, usually newspaper texts, due to the wide availability of these kinds of texts. For example, the Penn TreeBank corpus, consisting of 4.5 million words of American English [Marcus et al., 1993], was manually annotated with part-of-speech tags and parse trees, and it is often the main resource used to train part-of-speech taggers and parsers.
Current NLP tools tend to work poorly on social media texts, because these texts are informal, not carefully edited, and they contain grammatical errors, misspellings, new types of abbreviations, emoticons, etc. They are very different than the types of texts used for training the NLP tools. Therefore, the tools need to be adapted in order to achieve reasonable levels of performance on social media texts.
Table 2.1 shows three examples of Twitter messages, taken from Ritter et al. [2011], just to illustrate how noisy the texts can be.
Table 2.1: Three examples of Twitter texts
There are two ways to adapt NLP tools to social media texts. The first one is to perform text normalization so that the informal language becomes closer to the type of texts on which the tools were trained. The second one is to re-train the models inside the tool on annotated social media texts. Depending on the goal of the NLP application, a combination of the two techniques could be used, since both have their own limitations, as discussed below (see Eisenstein [2013b] for a more detailed discussion).
2.2.1 TEXT NORMALIZATION
Text normalization is a possible solution for overcoming or reducing linguistic noise. The task can be approached in two stages: first, the identification of orthographic errors in an input text, and second, the correction of these errors. Normalization approaches typically include a dictionary of known correctly spelled terms, and detects in-vocabulary and out-of-vocabulary (OOV) terms with respect to this dictionary. The normalization can be basic or more advanced. Basic normalization deals with the errors detected at the POS tagging stage, such as unknown words, misspelled words, etc. Advanced normalization is more flexible, taking a lightly supervised automatic approach trained on an external dataset (annotated with short forms vs. their equivalent long or corrected forms).
For social media texts, the normalization that can be done is rather shallow. Because of its informal and conversational nature, social media text cannot become carefully edited English. Similar issues appear in SMS text messages on phones, where short forms and phonetic abbreviations are often used to save the typing time. According to Derczynski et al. [2013b], text normalization in Twitter messages did not help too much in the named entity recognition task.
Twitter text normalization into traditional written English [Han and Baldwin, 2011] is not only difficult, but it can be viewed as a “lossy” translation task. For example, many of Twitter’s unique linguistic phenomena are due not only to its informal nature, but also to a set of authors that is heavily skewed toward younger ages and minorities, with heavy usage of dialects that are different than standard English [Eisenstein, 2013a, Eisenstein et al., 2011].
Demir [2016] describes a method of context-tailored text normalization. The method considers contextual and lexical similarities between standard and non-standard words, in order to reduce noise. The non-standard words in the input context in a given sentence are tailored into a direct match, if there are possible shared contexts. A morphological parser is used to analyze all the words in each sentence. Turkish social media texts were used to evaluate the performance of the system. The dataset contains tweets (~11 GB) and clean Turkish texts (~6 GB). The system achieved state-of-the-art results on the 715 Turkish tweets.
Akhtar et al. [2015] proposed a hybrid approach for text normalization for tweets. Their methodology proceeds in two phases: the first one detects noisy text, and the second one uses various heuristic-based rules for normalization. The researchers trained a supervised learning model, using 3-fold cross validation to determine the best feature set. Figure 2.1 depicts a schematic diagram of the proposed approach. Their system yielded precision, recall, and F-measure values of 0.90, 0.72, and 0.80, respectively, for their test dataset.
Most practical applications leverage the simpler approach of replacing non-standard words with their standard counterparts as a “one size fits all” task. Baldwin and Li [2015] devised a method that uses a taxonomy of normalization edits. The researchers evaluated this method on three different downstream applications: dependency parsing, named entity recognition, and text-to-speech synthesis. The taxonomy of normalization edits is shown in Figure 2.2. The method categorizes edits at three levels of granularity and its results demonstrate that the targeted application of the taxonomy is an efficient approach to normalization.
Figure 2.1: Methodology for tweet normalization. The dotted horizontal line separates the two steps (detecting the text to be normalized and applying normalization rules) [Akhtar et al., 2015].
Figure 2.2: Taxonomy of normalization edits [Baldwin and Li, 2015].
2.2.2 RE-TRAINING NLP TOOLS FOR SOCIAL MEDIA TEXTS
Re-training NLP tools for social media texts is relatively easy if annotated training data are available. In general, adapting a tool to a specific domain or a specific type of text requires producing annotated training data for that kind of text. It is easy to collect text of the required kind, but to annotate it can be a difficult and time-consuming process.
Currently, some annotated social media data have become available, but the volume is not high enough. Several NLP tools have been re-trained on newly annotated data, sometimes by also keeping the original annotated training data for newspaper texts, in order to have a large enough training set. Another approach is to use some unannotated social media text in an unsupervised manner in addition to the small amounts of annotated social media text.
Another question is what kinds of social media texts to use for training. It seems that Twitter messages are more difficult to process than blog posts or messages from forums. Because of the limitation of Twitter messages to 140 characters, more abbreviations and shortened forms of words are used, and more simplified syntax. Therefore, training data should include several kinds of social media texts (unless somebody is building a tool designed for a particular kind of social media text).
We define the tasks accomplished by each kind of tool and we discuss techniques for adapting them to social media texts.
2.3 TOKENIZERS
The first step in processing a text is to separate the words from punctuation and other symbols. A tool that does this is called a tokenizer. White space is a good indicator of words separation (except in some languages, e.g., Chinese), but even white space is not sufficient. The question of what is a word is not trivial. When doing corpus analysis, there are strings of characters that are clearly words, but there are strings for which this is not clear. Most of the time, punctuation needs to be separated from words, but some abbreviations might contain punctuation characters as part of the word. Take, for example, the sentence: “We bought apples, oranges, etc.” The commas clearly need to be separated from the word “apples” and from the word “oranges,” but the dot is part of the abbreviation “etc.” In this case, the dot also indicates the end of the sentence (two dots were reduced to one). Other examples among the many issues that appear are: how to treat numbers (if they contain commas or dots, these characters should not be separated), or what to do with contractions such as “don’t” (perhaps to expand them into two words “do” and “not”).
While tokenization usually consists of two subtasks (sentence boundary detection and token boundary detection), the EmpiriST shared task1 provided sentence boundaries and the participating teams only had to detect token boundaries. Missing whitespace characters presents a major challenge to the task of tokenization. Table 2.2 shows a few examples with their correct tokenization.
Methods for Tokenizers
Horsmann and Zesch [2016] evaluated a method for dealing with token boundaries consisting of three steps. First, the researchers split the text according to the white space characters. Then they employed regular expressions to refine the splitting of alpha-numerical text segments from punctuation characters in special character sequences such as similes. Finally, these sequences of punctuation are reassembled. They merge the most common combinations of characters into a single token using the training data, and use word lists to merge abbreviations with their following dot character. They increase accuracy in the experiment using more in-domain training data.
Table 2.2: Examples of tokenization
Evaluation Measures for Tokenizers
Accuracy is a simple measure that calculates how many correct decisions a tool makes. When not all the expected tokens are retrieved, precision and recall are the measure to report. The precision of the tokens recognition measures how many tokens are correct out of how many were found. Recall measures the coverage (from the tokens that should have been retrieved, how many were found). F-measure (or F-score) is often reported when one single number is needed, because F-measure is the harmonic mean of the precision and recall, and it is high only when both the precision and the recall are high.2 Evaluation measures are rarely reported for tokenizers, one exception being the CleanEval shared task which focused on tokenizing text from web pages [Baroni et al., 2008].
Many NLP projects tend to not mention what kind of tokenization they used, and focus more on higher-level processing. Tokenization, however, can have a large effect on the results obtained at the next levels. For example, Fokkens et al. [2013] replicated two high-level tasks from previous work and obtained very different results, when using the same settings but different tokenization.
Adapting Tokenizers to Social Media Texts
Tokenizers need to deal with the specifics of social media texts. Emoticons need to be detected as tokens. For Twitter messages, user names (starting with @), hashtags (starting with #), and URLs (links to web pages) should be treated as tokens, without separating punctuation or other symbols that are part of the token. Some shallow normalization can be useful at this stage. Derczynski et al. [2013b] tested a tokenizer on Twitter data, and its F-measure was around 80%. By using regular expressions designed specifically for Twitter messages, they were able to increase the F-measure to 96%. More about such regular expressions can be found in [O’Connor et al., 2010].
2.4 PART-OF-SPEECH TAGGERS
Part-of-speech (POS) taggers determine the part of speech of each word in a sentence. They label nouns, verbs, adjectives, adverbs, interjections, conjunctions, etc. Often they use finer-grained tagsets, such as singular nouns, plural nouns, proper nouns, etc. Different tagsets exist, one of the most popular being the Penn TreeBank tagset3 [Marcus et al., 1993]. See Table 2.3 for one of its more popular lists of the tags. The models embedded in the POS taggers are often complex, based on Hidden Markov Models [Baum and Petrie, 1966], Conditional Random Fields [Lafferty et al., 2001], etc. They need annotated training data in order to learn probabilities and other parameters of the models.
Methods for Part-of-speech Taggers
Horsmann and Zesch [2016] trained a CRF classifier [Lafferty et al., 2001] using the FlexTag tagger [Zesch and Horsmann, 2016] There are two adaptations involved in this method. The first is a general domain adaptation. The researchers applied a domain adaption strategy, which they proposed as a competitive model to improve the accuracy for tagging social media texts. To train their model, they used the CMC and Web corpora subsets from the EmpiriST shared task and some additional 100,000 tokens of newswire text from the Tiger corpus. The second adaptation is specific to the EmpiriST shared task. Because some PoS tags are too rare to be learned from training data, the researchers utilized a post-processing step that leveraged heuristics. This step involved the use of regular expressions and word lists from Wikipedia and Wiktionary to improve named entity recognition and case-insensitive matching. Selecting tags from the larger Tiger corpus introduced bias, so the researchers added extra Boolean features to their model.
Evaluation Measures for Part-of-speech Taggers
The accuracy of the tagging is usually measured as the number of tags correctly assigned out of the total number of words/tokens being tagged.
Adapting Part-of-speech Taggers
POS taggers clearly need re-training in order to be usable on social media data. Even the set of POS tags used must be extended in order to adapt to the needs of this kind of text. Ritter et al. [2011] used the Penn TreeBank tagset (Table 2.3) to annotate 800 Twitter messages. They added a few new tags for the Twitter-specific phenomena: retweets, @usernames, #hashtags, and URLs. Words in these categories can be tagged with very high accuracy using simple regular expressions, but they still need to be taken into consideration as features in the re-training of the taggers (for example as tags of the previous word to be tagged). In Ritter et al. [2011], the POS tagging accuracy drops from about 97% on newspaper text to 80% on the 800 tweets. These numbers are reported for the Stanford POS tagger [Toutanova et al., 2003]. Their POS tagger T-POS—based on a Conditional Random Field classifier and on the clustering of out-of-vocabulary (OOV) words—also obtained low performance on Twitter data (81%). By retraining the T-POS tagger on the annotated Twitter data (which is rather small), the accuracy increases to 85%. The best accuracy raises to 88% when the size of the training data is increased by adding to the Twitter data the initial Penn TreeBank training data, plus 40,000 tokens of annotated Internet Relay Chat (IRC) data [Forsyth and Martell, 2007], which is similar in style to Twitter data. Similar numbers are reported by Derczynski et al. [2013b] on a part of the same Twitter dataset.
Table 2.3: Penn TreeBank tagset
A key reason for the drop in accuracy on Twitter data is that the data contains far more OOV words than grammatical text. Many of these OOV words come from spelling variation, e.g., the use of the word n for in in Example 3 from Table 2.1 The tag for proper nouns (NNP) is the most frequent tag for OOV words, while in fact only about one third are proper nouns.
Gimpel et al. [2011] developed a new POS tagset for Twitter (see Table 2.4), that is more coarse-grained, and it pays particular attention to punctuation, emoticons, and Twitter-specific tags (@usernames, #hashtags, URLs). They manually tagged 1,827 tweets with the new tagset; then, they trained a POS tagging model that uses features geared toward Twitter text. The experiments conducted to evaluate the model showed 90% accuracy for the POS tagging task. Owoputi et al. [2013] improved on the model by using word clustering techniques and trained the POS tagger on a better dataset of tweets and chat messages.4
2.5 CHUNKERS AND PARSERS
A chunker detects noun phrases, verb phrases, adjectival phrases, and adverbial phrases, by determining the start point and the end point of every such phrase. Chunkers are often referred to as shallow parsers because they do not attempt to connect the phrases in order to detect the syntactic structure of the whole sentence.
A parser performs the syntactic analysis of a sentence, and usually produces a parse tree. The trees are often used in future processing stages, toward semantic analysis or information extraction.
A dependency parser extracts pairs of words that are in a syntactic dependency relation, rather than a parse tree. Relations can be verb-subject, verb-object, noun-modifier, etc.
Evaluation Measures for Chunking and Parsing
The Parseval evaluation campaign [Harrison et al., 1991] proposed measures that compare the phrase-structure bracketings5 produced by the parser with bracketings in the annotated corpus (treebank). One computes the number of bracketing matches M with respect to the number of bracketings P returned by the parser (expressed as precision M/P) and with respect to the number C of bracketings in the corpus (expressed as recall M/C). Their harmonic mean, the F-measure, is most often reported for parsers. In addition, the mean number of crossing brackets per sentence could be reported, to count the number of cases when a bracketed sequence from the parser overlaps with one from the treebank (i.e., neither is properly contained in the other). For chunking, the accuracy can be reported as the tag correctness for each chunk (labeled accuracy), or separately for each token in each chunk (token-level accuracy). The former is stricter because it does not give credit to a chunk that is partially correct but incomplete, for example one or more words too short or too long.
Table 2.4: POS tagset from Gimpel et al. [2011]
Adapting Parsers
Parsing performance also decreases on social media text. Foster et al. [2011] tested four dependency parsers and showed that their performance decreases from 90% F-score on newspaper text to 70–80% on social media text (70% on Twitter data and 80% on discussion forum texts). After retraining on a small amount of social media training data (1,000 manually corrected parses) plus a large amount of unannotated social media text, the performance increased to 80–83%. Ovrelid and Skjærholt [2012] also show the labeled accuracy of dependency parsers decreasing from newspaper data to Twitter data.
Ritter et al. [2011] also explored shallow parsing and noun phrase chunking for Twitter data. The token-level accuracy for the shallow parsing of tweets was 83% with the OpenNLP chunker and 87% with their shallow parser T-chunk. Both were re-trained on a small amount of annotated Twitter data plus the Conference on Natural Language Learning (CoNLL) 2000 shared task data [Tjong Kim Sang and Buchholz, 2000].
Khan et al. [2013] reported experiments on parser adaptation to social media texts and other kinds of Web texts. They found that text normalization helps increase performance by a few percentage points, and that a tree reviser based on grammar comparison helps to a small degree. A dependency parser named TweeboParser6 was developed specifically on a recently annotated Twitter treebank for 929 tweets [Kong et al., 2014]. It uses the POS tagset from Gimpel et al. [2011] presented in Table 2.4. Table 2.5 shows an example of output of the parser for the tweet: “They say you are what you eat, but it’s Friday and I don’t care! #TGIF (@ Ogalo Crows Nest) http://t.co/l3uLuKGk:”
The columns represent, in order: ID is the token counter, starting at 1 for each new sentence; FORM is the word form or punctuation symbol; CPOSTAG is the coarse-grained part-of-speech tag, where the tagset depends on the language; POSTAG is the fine-grained part-of-speech tag, where the tagset depends on the language, or it is identical to the coarse-grained part-of-speech tag, if not available; HEAD is the head of the current token, which is either an ID (–1 indicates that the word is not included in the parse tree; some treebanks also used zero as ID); and finally, DEPREL is the dependency relation to the HEAD. The set of dependency relations depends on the particular language. Depending on the original treebank annotation, the dependency relation may be meaningful or simply “ROOT.” So, for this tweet, the dependency relations are MWE (Multiword expression), CONJ (Conjunct), and many other relations between the word IDs, but they are not named (probably due to the limited training data used when the parser was trained). The dependency relations from the Stanford dependency parser are included, if they can be detected in a tweet. If they cannot be named, they are still in the table, but without a label.
Table 2.5: Example of tweet parsed with the TweeboParser
2.6 NAMED ENTITY RECOGNIZERS
A named entity recognizer (NER) detects names in the texts, as well as dates, currency amounts, and other kinds of entities. NER tools often focus on three types of names: Person, Organization, and Location, by detecting the boundaries of these phrases. There are a few other types of tools that can be useful in the early stages of NLP applications. One example is a co-reference resolution tool that can be used to detect the noun that a pronoun refers to or to detect different noun phrases that refer to the same entity. In fact, NER is a semantic task, not a linguistic pre-processing task, but we introduce it this chapter because it became part of many of the recent NLP tools discussed in this chapter. We will talk more about specific kind of entities in Sections 3.2 and 3.3, in the context of integrating more and more semantic knowledge when solving the respective tasks.
Methods for NER
NER is composed of two sub-tasks: detecting entities (the span of text where a name starts and where it ends) and determining/classifying the type of entity. The methods used in NER are either based on linguistic grammars for each type of entity, either based on statistical methods. Semi-supervised learning techniques were proposed, but supervised learning, especially based on CRFs for sequence learning, are the most prevalent. Hand-crafted grammar-based systems typically obtain good precision, but at the cost of lower recall and months of work by experienced computational linguists. Supervised learning techniques were used more recently due the availability of annotated training datasets, mostly for newspaper texts, such as data from MUC 6, MUC 7, and ACE,7 and also the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003].
Tkachenko et al. [2013] described a supervised learning method for named-entity recognition. Feature engineering and learning algorithm selection are critical factors when designing a NER system. Possible features could include word lemmas, part-of-speech tags, and occurrence in some dictionary that encodes characteristic attributes of words relevant for the classification task. Tkachenko et al. [2013] included morphological, dictionary-based, WordNet-based, and global features. For their learning algorithm, the researchers chose CRFs, which have a sequential nature and ability to handle a large number of features. As also mentioned above, CRFs are widely used for the task of NER. For the Estonian dataset, the system produced a gold standard NER corpus, on which their CRF-based model achieved an overall F1-score of 0.87.
He and Sun [2017] developed a semi-supervised leaning model based on deep neural networks (B-LSTM). This system combined transition probabilities with deep learning to train the model directly on F-score and label accuracy. The researchers used a modified, labeled corpus which corrected labeling errors in data developed by Peng and Dredze [2016] for NER in Chinese social media. They evaluated their model on NER and nominal mention tasks. The result for NER on the dataset of Peng and Dredze [2016] is the state-of-the-art NER system in Chinese Social Media. Their B-LSTM model achieved F-scores of 0.53.
Evaluation Measures for NER
The precision, recall, and F-measure can be calculated at sequence level (whole span of text) or at token level. The former is stricter because each named entity that is longer than one word has to have an exact start and end point. Once entities have been determined, the accuracy of assigning them to tags such as Person, Organization, etc., can be calculated.
Adaptation for Named Entity Recognition
Named entity recognition methods typically have 85–90% accuracy on long and carefully edited texts, but their performance decreases to 30–50% on tweets [Li et al., 2012a, Liu et al., 2012b, Ritter et al., 2011].
Ritter et al. [2011] reported that the Stanford NER obtains 44% accuracy on Twitter data. They also presented new NER methods for social media texts based on labeled Latent Dirichlet Allocation (LDA)8 [Ramage et al., 2009], that allowed their T-Seg NER system to reach an accuracy of 67%.
Derczynski et al. [2013b] reported that NER performance drops from 77% F-score on newspaper text to 60% on Twitter data, and that after adaptation it increases to 80% (with the ANNIE NER system from GATE) [Cunningham et al., 2002]. The performance on newspaper data was computed on the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003], while the performance on social media data was computed on part of the Ritter dataset [Ritter et al., 2011], which contains of 2,400 tweets comprising 34,000 tokens.
Particular attention is given to microtext normalization, as a way of removing some of the linguistic noise prior to part-of-speech tagging and entity recognition [Derczynski et al., 2013a, Han and Baldwin, 2011]. Some research has focused on named entity recognition algorithms specifically for Twitter messages, training new CRF model on Twitter data [Ritter et al., 2011].
An NER tool can detect various kinds of named entities, or focus only on one kind. For example, Derczynski and Bontcheva [2014] presented methods for detecting person entities. Chapter 3 will discuss methods for detecting other specific kinds of entities. The NER tools can detect entities, disambiguate them (when more than one entity with the same name exists), or solve co-references (when there are several ways to refer to the same entity).
2.7 EXISTING NLP TOOLKITS FOR ENGLISH AND THEIR ADAPTATION
There are many NLP tools developed for generic English and fewer for other languages. We list here several selected tools that have been adapted for social media text. Others may be available, just perhaps not useful in social media texts, although new tools are being developed or adapted. Nonetheless, we will briefly mention several toolkits that offer a collection of tools, also called suites if the tools can be used in a sequence of consecutive steps, from tokenization to named entity recognition or more. Some of them can be re-trained for social media texts.
The Stanford CoreNLP is an integrated suite of NLP tools for English programmed in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and co-reference. A text classifier is also available.9
Open NLP includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution, implemented in Java. It also includes maximum entropy and perceptron-based machine learning algorithms.10
FreeLing includes tools for English and several other languages: text tokenization, sentence splitting, morphological analysis, phonetic encoding, named entity recognition, POS tagging, chart-based shallow parsing, rule-based dependency parsing, nominal co-reference resolution, etc.11
NLTK is a suite of text processing libraries in Python for classification, tokenization, stemming, POS tagging, parsing, and semantic reasoning.12
GATE includes components for diverse language processing tasks, e.g., parsers, morphology, POS tagging. It also contains information retrieval tools, information extraction components for various languages, and many others. The information extraction system (ANNIE) includes a named entity detector.13
NLPTools is a library for NLP written in PHP, geared toward text classification, clustering, tokenizing, stemming, etc.14
Some components of these toolkits were re-trained for social media texts, such as the Stanford POS tagger by Derczynski et al. [2013b], and the OpenNLP chunker by Ritter et al. [2011], as we noted earlier.
One toolkit that was fully adapted to social media text is GATE. A new module or plugin called TwitIE15 is available [Derczynski et al., 2013a] for tokenization of Twitter texts, as well as POS tagging, name entities recognition, etc.
Two new toolkits were built especially for social media texts: the TweetNLP tools developed at CMU and the Twitter NLP tools developed at the University of Washington (UW).
TweetNLP is a Java-based tokenizer and part-of-speech tagger for Twitter text [Owoputi et al., 2013]. It includes training data of manually labeled POS annotated tweets (that we noted above), a Web-based annotation tool, and hierarchical word clusters from unlabeled tweets.16 It also includes the TweeboParser mentioned above.
The UW Twitter NLP Tools [Ritter et al., 2011] contain the POS tagger and the annotated Twitter data (mentioned above—see adaptation of POS taggers).17
A few other tools for English are in development, and a few tools for other languages have been adapted or can be adapted to social media text. The development of the latter is slower, due to the difficulty in producing annotated training data for many languages, but there is progress. For example, a treebank for French social media texts was developed by Seddah et al. [2012].
2.8 MULTI-LINGUALITY AND ADAPTATION TO SOCIAL MEDIA TEXTS
Social media messages are available in many languages. Some messages could be mixed, for example part in English and part in another language. This is called “code switching.” If tools for multiple languages are available, a language identification tool needs to be run on the texts before using the right language-specific tools for the next processing steps.
2.8.1 LANGUAGE IDENTIFICATION
Language identification can reach very high accuracy for long texts (98–99%), but it needs adaptation to social media texts, especially to short texts such as Twitter messages.
Derczynski et al. [2013b] showed that language identification accuracy decreases to around 90% on Twitter data, and that re-training can lead to 95–97% accuracy levels. This increase is easily achievable for tools that classify into a small number of languages, while tools that classify into a large number of languages (close to 100 languages) cannot be further improved on short informal texts. Lui and Baldwin [2014] tested six language identification tools and obtained the best results on Twitter data by majority voting over three of them, up to an F-score of 0.89.
Barman et al. [2014] presented a new dataset containing Facebook posts and comments that exhibit code mixing between Bengali, English, and Hindi. The researchers demonstrated some preliminary word-level language identification experiments using this dataset. The methods surveyed included a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labeling using Conditional Random Fields. The preliminary results demonstrated the superiority of supervised classification and sequence labeling over dictionary-based classification, suggesting that contextual clues are necessary for accurate classifiers. The CRF model achieved the best result with an F-score of 0.95.
There is a lot of work on language identification in social media. Twitter has been a favorite target, and a number of papers deal with language identification of Twitter messages specifically Bergsma et al. [2012], Carter et al. [2013], Goldszmidt et al. [2013], Mayer [2012], Tromp and Pechenizkiy [2011]. Tromp and Pechenizkiy [2011] proposed a graph-based n-gram approach that works well on tweets. Lui and Baldwin [2014] looked specifically at the problem of adapting existing language identification tools to Twitter messages, including challenges in obtaining data for evaluation, as well as the effectiveness of proposed strategies. They tested several tools on Twitter data (including a newly collected corpus for English, Japanese, and Chinese). The tests were done with off-the-shelf tools, before and after a simple cleaning of the Twitter data, such as removing hashtags, mentions, emoticons, etc. The improvement after the cleaning was small. Bergsma et al. [2012] looked at less common languages, in order to collect language-specific corpora. The nine languages they focused on (Arabic, Farsi, Urdu, Hindi, Nepali, Marathi, Russian, Bulgarian, Ukrainian) use three different non-Latin scripts: Arabic, Devanagari, and Cyrillic. Their method for language identification was based on language models.
Most of the methods used only the text of the message, but Carter et al. [2013] also looked at the use of metadata, an approach which is unique to social media. They identified five microblog characteristics that can help in language identification: the language profile of the blogger, the content of an attached hyperlink, the language profile of other users mentioned in the post, the language profile of a tag, and the language of the original post, if the post is a reply. Further, they presented methods that combine the prior language class probabilities in a post-dependent and post-independent way. Their test results on 1,000 posts from 5 languages (Dutch, English, French, German, and Spanish) showed improvements in accuracy by 5% over the baseline, and showed that post-dependent combinations of the priors achieved the best performance.
Taking a broader view of social media, Nguyen and Dogruöz [2013] looked at language identification in a mixed Dutch-Turkish Web forum. Mayer [2012] considered language identification of private messages between eBay users.
Here are some of the available tools for language identification.
• langid.py18 [Lui and Baldwin, 2012] works for 97 languages and uses a feature set selected from multiple sources, combined via a multinomial Naïve Bayes classifier.
• CLD2,19 the language identifier embedded in the Chrome Web browser,20 uses a Naïve Bayes classifier and script-specific tokenization strategies.
• LangDetect21 is a Naïve Bayes classifier, using a representation based on character n-grams without feature selection, with a set of normalization heuristics.
• whatlang [Brown, 2013] uses a vector-space model with per-feature weighting over character n-grams.
• YALI22 computes a per-language score using the relative frequency of a set of byte n-grams selected by term frequency.
• TextCat23 is an implementation of the method of Cavnar and Trenkle [1994] and it uses an adhoc rank-order statistic over character n-grams.
Only some of the available tools were trained directly on social media data.
• LDIG24 is an off-the-shelf Java language identification tool targeted specifically at Twitter messages. It has pre-trained models for 47 languages. It uses a document representation based on data structures named tries.25
• MSR-LID [Goldszmidt et al., 2013] is based on rank-order statistics over character n-grams, and Spearman’s coefficient to measure correlations. Twitter-specific training data was acquired through a bootstrapping approach.
Some datasets of social media texts annotated with language labels are available.
• The dataset of Tromp and Pechenizkiy [2011] contains 9,066 Twitter messages labeled with one of the six languages: German, English, Spanish, French, Italian, and Dutch.26
• The Twituser language identification dataset27 of Lui and Baldwin [2014] for English, Japanese, and Chinese.
2.8.2 DIALECT IDENTIFICATION
Sometimes it is not enough that a language has been identified correctly. A case in point is Arabic. It is the official language in 22 countries, spoken by more than 350 million people worldwide.28 Modern Standard Arabic (MSA) is the written form of Arabic used in education; it is also the formal communication language. Arabic dialects or colloquial languages are spoken varieties of Arabic, and spoken daily by Arab people. There are more than 22 dialects; some countries share the same dialect, while many dialects may exist alongside MSA within the same Arab country. Arabic speakers prefer to use their own local dialect. Recently, more attention has been given to the Arabic dialects and the written varieties of Arabic found on social networking sites such as chats, micro-blogs, blogs, and forums which are the target of research on sentiment analysis and opinion extraction.
Huang [2015] shows us an approach to improving Arabic dialect classification with semi-supervised learning. He trained multiple classifiers using a combination of weakly supervised, strongly supervised, and unsupervised classifiers. These combinations yielded significant and consistent improvement on two test sets. The dialect classification accuracy improved by 5% over the strongly supervised classifier and 20% over the weakly supervised classifier. Furthermore, when applying the improved dialect classifier to build a MSA language model (LM), the new model size was reduced by 70%, while the English-Arabic translation quality improved by 0.6 BLEU points.
Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. Figure 2.3 illustrates the AD distribution.
Figure 2.3: Arabic dialects distribution and variation across Asia and Africa [Sadat et al., 2014a].
There is a possible division of regional language within the six regional groups, as follows: Egyptian, Levantine, Gulf, Iraqi, Maghrebi, and others, as shown in Figure 2.4.
Dialect identification is closely related to the language identification problem. The dialect identification task attempts to identify the spoken dialect from within a set of texts that use the same character set in a known language.
Due to the similarity of dialects within a language, dialect identification is more difficult than language identification. Machine learning approaches and language models which are used for language identification need to be adapted for dialect identification as well.
Several projects on NLP for MSA have been carried out, but research on Dialectal Arabic NLP is in early stages [Habash, 2010].
When processing Arabic for the purposes of social media analysis, the first step is to identify the dialect and then map the dialect to MSA, because there is a lack of resources and tools for Dialectal Arabic NLP. We can therefore use MSA tools and resources after mapping the dialect to MSA.
Figure 2.4: Division of Arabic dialects in six groups/divisions [Sadat et al., 2014a].
Diab et al. [2010] have run the COLABA project, a major effort to create resources and processing tools for Dialectal Arabic blogs. They used the BAMA and MAGEAD morphological analyzers. This project focused on four dialects: Egyptian, Iraqi, Levantine, and Moroccan.
Several tools for MSA regarding text processing—BAMA, MAGED, and MADA—will now be described briefly.
BAMA (Buckwalter Arabic Morphological Analyzer) provides morphological annotation for MSA. The BAMA database contains three tables of Arabic stems, complex prefixes, and complex suffixes and three additional tables used for controlling prefix-stem, stem-suffix, and prefix-suffix combinations [Buckwalter, 2004].
MAGEAD is a morphological analyzer and generator for the Arabic languages including MSA and the spoken dialects of Arabic. MAGEAD is modified to analyze the Levantine dialect [Habash and Rambow, 2006].
MADA+TOKEN is a toolkit for morphological analysis and disambiguation for the Arabic language that includes Arabic tokenization, discretization, disambiguation, POS tagging, stemming, and lemmatization. MADA selects the best analysis result within all possible analyses for each word in the current context by using SVM models classifying into 19 weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary, and morphological information. TOKEN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA depends on three resources: BAMA, the SRILM toolkit, and SVMTools [Habash et al., 2009].
Going back to the problem of AD identification, we give here a detailed example, with results. Sadat et al. [2014c] provided a framework for AD classification using probabilistic models across social media datasets. They incorporated the two popular techniques for language identification: the character n-gram Markov language model and Naïve Bayes classifiers.29
The Markov model calculates the probability that an input text is derived from a given language model built from training data [Dunning, 1994]. This model enables the computation of the probability P(S) or likelihood, of a sentence S, by using the following chain formula in the following equation:
The sequence (w1, w2, …, wn) represents the sequence of characters in a sentence S. P(wi|w1, …wi−1) represents the probability of the character wi given the sequence w1, …wi−1.
A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. In text classification, this classifier assigns the most likely category or class to a given document d from a set of pre-defined N classes as c1, c2, …, cN. The classification function f maps a document to a category (f : D → C) by maximizing the probability of the following equation [Peng and Schuurmans, 2003]:
where d and c denote the document and the category, respectively. In text classification a document d can be represented by a vector of T attributes d = (t1, t2,…, tT). Assuming that all attributes ti are independent given the category c, we can calculate P(d|c) with the following equation:
The attribute term ti can be a vocabulary term, local n-gram, word average length, or a global syntactic and semantic property [Peng and Schuurmans, 2003].
Sadat et al. [2014c] presented a set of experiments using these techniques with detailed examination of what models perform best under different conditions in a social media context. Experimental results showed that the Naïve Bayes classifier based on character bigrams can identify the 18 different Arabic dialects considered with an overall accuracy of 98%. The dataset used in the experiments was manually collected from forums and blogs, for each of the 18 dialects.
To look at the problem in more detail, Sadat et al. [2014a] applied both the n-gram Markov language model and the Naïve Bayes classifier to classify the eighteen Arabic dialects. The results of this study for the n-gram Markov language model is represented in Figure 2.5. This figure shows that the character-based unigram distribution helps the identification of two dialects, the Mauritanian and the Moroccan with an overall F-measure of 60% and an overall accuracy of 96%. Furthermore, the bigram distribution of 2 characters affix helps recognize 4 dialects, the Mauritanian, Moroccan, Tunisian, and Qatari, with an overall F-measure of 70% and overall accuracy of 97%. Lastly, the trigram distribution of three characters affix helps recognize four dialects, the Mauritanian, Tunisian, Qatari, and Kuwaiti, with an overall F-measure of 73% and an overall accuracy of 98%. Overall, for 18 dialects, the bigram model performed better than other models (unigram and trigram models).
Figure 2.5: Accuracies on the character-based n-gram Markov language models for 18 countries [Sadat et al., 2014a].
Since many dialects are related to a region, and these Arabic dialects are approximately similar, the authors also considered the accuracy of dialects group. Figure 2.6 shows the result on the three different character n-gram Markov language models and a classification on the six groups of divisions that were defined in Figure 2.4. Again, the bigram and trigram character Markov language models performed almost the same as in Figure 2.5, although the F-Measure of the bigram model for all dialect groups was higher than for the trigram model, except for the Egyptian dialect. Therefore, on average, for all dialects, the character-based bigram language model performed better than the character-based unigram and trigram models.
Figure 2.6: Accuracies on the character-based n-gram Markov language models for the six divisions/groups [Sadat et al., 2014a].
Figure 2.7 shows the results on the n-gram models using Naïve Bayes classifiers for the different countries, while Figure 2.8 shows the results on the n-gram models using Naïve Bayes classifiers for the six divisions according to Figure 2.4. The results show that the Naïve Bayes classifiers based on character unigram, bigram, and trigram have better results than the previous character-based unigram, bigram, and trigram Markov language models, respectively. An overall F-measure of 72% and an accuracy of 97% were noticed for the 18 Arabic dialects. Furthermore, the Naïve Bayes classifier that is based on a bigram model has an overall F-measure of 80% and an accuracy of 98%, except for the Palestinian dialect because of the small size of the data. The Naïve Bayes classifier based on the trigram model showed an overall F-measure of 78% and an accuracy of 98% except for the Palestinian and Bahrain dialects. This classifier could not distinguish between the Bahrain and the Emirati dialects because of the similarities on their three affixes. In addition, the Naïve Bayes classifier based on character bigrams performed better than the classifier based on character trigrams, according to Figure 2.7. Also, as shown in Figure 2.8, the accuracy of dialect groups for the Naïve Bayes classifier based on character bigram model yielded better results than the two other models (unigrams and trigrams).
Recently, Zaidan and Callison-Burch [2014] created a large monolingual data set rich in dialectal Arabic content called the Arabic Online Commentary Dataset. They used crowdsourcing for annotating the texts with the dialect label. They also presented experiments on the automatic classification of the dialects for this dataset, using similar word and character-based language models. The best results were around 85% accuracy for distinguishing MSA from dialectal data and lower accuracies for identifying the correct dialect for the latter case. Then they applied the classifiers to discover new dialectical data from a large Web crawl consisting of 3.5 million pages mined from online Arabic newspapers.
Figure 2.7: Accuracies on the character-based n-gram Naïve Bayes classifiers for 18 countries [Sadat et al., 2014a].
Several other projects focused on Arabic dialects: classification [Tillmann et al., 2014], code switching [Elfardy and Diab, 2013], and collecting a Twitter corpus for several dialects [Mubarak and Darwish, 2014].
2.9 SUMMARY
This chapter discussed the issue of adapting NLP tools to social media texts. One way is to use text normalization techniques, in order to make the text closer to standard carefully edited texts on which the NLP tools are usually trained. The normalization that can be achieved in practice is rather shallow and it does not seem to help much in improving the performance of the tools. The second way of adapting the tools is to re-train them on annotated social media data. This significantly improves the performance, although the amount of annotated data available for retraining is still small. Further development of annotated data sets for social media data is needed in order to reach very high levels of performance.
In the next chapter, we will look at advanced methods for various NLP tasks for social media texts. These tasks use as components some of the tools discussed in this chapter.
Figure 2.8: Accuracies on the character-based n-gram Naïve Bayes classifiers for the six divisions/groups [Sadat et al., 2014a].
1 https://sites.google.com/site/empirist2015/
2The F-score usually gives the same weight to precision and to recall, but it can weight one of them more when needed for an application.
3 http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html
4This data set is available at http://code.google.com/p/ark-tweet-nlp/downloads/list
.
5A bracketing is a pair of matching opening and closing brackets in a linearized tree structure.
6 http://www.ark.cs.cmu.edu/TweetNLP/#tweeboparser_tweebank
7 http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html
8LDA is a method that assumes a number of hidden topics for a corpus, and discovers a cluster of words for each topic, with associated probabilities. Then, for each document, LDA can estimate a probability distribution over the topics. The topics—word clusters—do not have names, but names can be given, for example, by choosing the word with the highest probability in each cluster.
9 http://nlp.stanford.edu/downloads/
10 http://opennlp.apache.org/
11 http://nlp.lsi.upc.edu/freeling/
12 http://nltk.org/
13 http://gate.ac.uk/
14 http://php-nlp-tools.com/
15 https://gate.ac.uk/wiki/twitie.html
16 http://www.ark.cs.cmu.edu/TweetNLP/
17 https://github.com/aritter/twitter_nlp
18 https://github.com/saffsd/langid.py
19 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
20 http://www.google.com/chrome
21 https://code.google.com/p/language-detection/
22 https://github.com/martin-majlis/YALI
23 http://odur.let.rug.nl/~vannoord/TextCat/
24 https://github.com/shuyo/ldig
25 http://en.wikipedia.org/wiki/Trie
26 http://www.win.tue.nl/~mpechen/projects/smm/
27 http://people.eng.unimelb.edu.au/tbaldwin/data/lasm2014-twituser-v1.tgz
28 http://en.wikipedia.org/wiki/Geographic_distribution_of_Arabic#Population
29We will describe the concept of Naïve Bayes classifiers in detail in this section because they tend to work well on textual data and they are fast in terms of training and testing time.