Читать книгу Natural Language Processing for the Semantic Web - Diana Maynard - Страница 11

Оглавление

CHAPTER 3

Named Entity Recognition and Classification

3.1 INTRODUCTION

As discussed in Chapter 1, information extraction is the process of extracting information from unstructured text and turning it into structured data. Central to this is the task of named entity recognition and classification (NERC), which involves the identification of proper names in texts (NER), and their classification into a set of predefined categories of interest (NEC). Unlike the pre-processing tools discussed in the previous chapter, which deal with syntactic analysis, NERC is about automatically deriving semantics from textual content. The traditional core set of named entities, developed for the shared NERC task at MUC-6 [²⁵], comprises Person, Organization, Location, and Date and Time expressions, such as Barack Obama, Microsoft, New York, 4th July 2015, etc.

NERC is generally an annotation task, i.e., to annotate a text with named entities (NEs), but it can involve simply producing a list of NEs which may then be used for other purposes, including creating or extending gazetteers to assist with the NE annotation process in future. It can be subdivided into two tasks: the recognition task, involving identifying the boundaries of an NE (typically referred to as NER); and named entity classification (NEC), involving detecting the class or type of the NE. Slightly confusingly, NER is often used to mean the combination of the two tasks, especially in older work; here we stick to using NERC for the combined task and NER for only the recognition element. For more fine-grained NEC than the standard Person, Organization, and Location classification, classes are often taken from an ontology schema and are subclasses of these [²⁶]. The main challenge for NEC is that NEs can be highly ambiguous (e.g., “May” can be a person’s name or a month of the year; “Mark” can be a person’s name or a common noun). Partly for this reason, the two tasks of NER and NEC are typically solved as a single task.

A further task regarding named entities is named entity linking (NEL). The NEL task is to recognize if a named entity mention in a text corresponds to any NEs in a reference knowledge base. A named entity mention is an expression in the text referring to a named entity: this may be under different forms, e.g., “Mr. Smith” and “John Smith” are both mentions (textual representations) of the same real-world entity, expressed by slightly different linguistic realizations. The reference knowledge base used is typically Wikipedia. NEL is even more challenging than NEC because distinctions do not only have to be made on the class-level, but also within classes. For example, there are many persons with the name “John Smith.” The more popular the names are, the more difficult the NEL task becomes. A further problem, which all knowledge base–related tasks have, is that knowledge bases are incomplete; for example, they will only contain the most famous people named “John Smith.” This is particularly challenging when working on tasks involving recent events, since there is often a time lag between newly emerging entities appearing in the news or on social media and the updating of knowledge bases with their information. More details on named entity linking, along with relevant reference corpora, are given in Chapter 5.

3.2 TYPES OF NAMED ENTITIES

The reason that Person, Organization, Location, Date, and Time have become so popular as standard types of named entity is due largely to the Message Understanding Conference series (MUC) [²⁵], which introduced the Named Entity Recognition and Classification task in 1995 and which drove the initial development of many systems which are still in existence today. Due to the expansion of NERC evaluation efforts (described in more detail in Section 3.3) and the need for using NERC tools in real-life applications, other kinds of proper nouns and expressions gradually also started to be considered as named entities, according to the task, such as newspapers, monetary amounts, and more fine-grained classifications of the above, such as authors, music bands, football teams, TV programs, and so on. NERC is the starting point for many more complex applications and tasks such as ontology building, relation extraction, question answering, information extraction, information retrieval, machine translation, and semantic annotation. With the advent of open information extraction scenarios focusing on the whole of the web, analysis of social media where new entities emerge constantly, and named entity linking tasks, the range of entities extracted has widened dramatically, which has brought many new challenges (see for example Section 4.4, where the role of knowledge bases for Named Entity Linking is discussed). Furthermore, the standard kind of 5- or 7-class entity recognition problem is now often less useful, which in turn means that new paradigms are required. In some cases, such as the recognition of Twitter user names, the distinction between traditional classes, such as Organization and Location, has become blurred even for a human, and is no longer always useful (see Chapter 8).

Defining what exactly should constitute each entity type is never easy, and guidelines differ according to the task. Traditionally, people have used the standard guidelines from the evaluations, such as MUC and CONLL, since these allow methods and tools to be compared with each other easily. However, as tools have been used for practical purposes in real scenarios, and as the types of named entities have consequently changed and evolved, so the ways in which entities are defined have also had to be adapted for the task. Of course, this now makes comparison and performance evaluation more difficult. The ACE evaluation [²⁷], in particular, attempted to solve some of the problems caused by metonymy, where an entity which theoretically depicts one type (e.g., Organization) is used figuratively. Sports teams are an example of this, where we might use the location England or Liverpool to mean their football team (e.g., England won the World Cup in 1966). Similarly, locations such as The White House or 10 Downing Street can be used to refer to the organization housed there (The White House announced climate pledges from 81 countries.). Other decisions involve determining, for example, if the category Person should include characters such as God or Santa Claus, and furthermore, if so, whether they should be included in all situations, such as when using God and Jesus as part of profanities.

3.3 NAMED ENTITY EVALUATIONS AND CORPORA

As mentioned above, the first major evaluation series for NERC was MUC, which first addressed the named entity challenge in 1996. The aim of this was to recognize named entities in newswire text, and led not only to system development but the first real production of gold standard NE-annotated corpora for training and testing. This was followed in 2003 by ConLL [²⁸], another major evaluation compaign, providing gold standard data for newswire not only in English but also Spanish, Dutch, and German. The corpus produced for this evaluation effort is now one of the most popular gold standards for NERC, with NERC software releases typically quoting performance on it.

Other evaluation campaigns later started to address NERC for genres other than newswire, specifically ACE [²⁷] and OntoNotes [²⁹], and introduced new kinds of named entities. Both of those corpora contain subcorpora with the genres newswire, broadcast news, broadcast conversation, weblogs, and conversational telephone speech. ACE additionally contains a subcorpus with usenet newsgroups, and addressed not only English but also Arabic and Chinese in later editions. Both ACE and OntoNotes also involved tasks such as coreference resolution, relation and event extraction, and word sense disambiguation, allowing researchers to study the interaction between these tasks. These tasks are addressed in Section 3.5 and in Chapters 4 and 5.

While NERC corpora mostly use the traditional entity types, such as Person, Organization and Location, which are not motivated by a concrete Semantic Web knowledge base (such as DBpedia, Freebase, or YAGO), these types are very general. This means that when developing NERC approaches on those corpora for Semantic Web purposes, it is relatively easy to build on top of them and to include links to a knowledge base later. For example, NERD [³⁰] uses an OWL ontology¹ containing the set of mappings of all entity categories (e.g., criminal is a sub-class of Person in the NERD ontology).

3.4 CHALLENGES IN NERC

One of the main challenges of NERC is to distinguish between named entities and entities. The difference between these two things is that named entities are instances of types (such as Person, Politician) and refer to real-life entities which have a single unique referent, whereas entities are often groups of NEs which do not refer to unique referents in the real world. For example, “Prime Minister” is an entity, but it is not a named entity because it refers to any one of a group of named entities (anyone who has been or currently is a prime minister). It is worth noting though that the distinction can be very difficult to make, even for humans, and annotation guidelines for tasks differ on this.

Another challenge is to recognize NE boundaries correctly. In Example 3.1, it is important to recognize that Sir is part of the name Sir Robert Walpole. Note that tasks also differ in where they place the boundaries. MUC guidelines define that a Person entity should include titles; however, other evaluations may define their tasks differently. A good discussion of the issues in designing NERC tasks, and the differences between them, can be found in [³¹]. The entity definitions and boundaries are thus often not consistent between different corpora. Sometimes, boundary recognition is considered as a separate task from detecting the type (Person, Location, etc.) of the named entity. There are several annotation schemes commonly used to recognize where NEs begin and end. One of the most popular ones is the BIO schema, where B signifies the Beginning of an NE, I signifies that the word is Inside an NE, and O signifies that the word is just a regular word Outside of an NE. Another very popular scheme is BILOU [³²], which has the additional labels L (Last word of an NE) and U (Unit, signifying that the word is an entire unit, i.e., NE).

Example 3.1 Sir Robert Walpole was a British statesman who is generally regarded as the first Prime Minister of Great Britain. Although the exact dates of his dominance are a matter of scholarly debate, 1721-1742 are often used.²

Politician: Government positions held (Officeholder, Office/position/title, From, To)

Person: Gender

Sir Robert Walpole: Politician, Person

Government positions held (Sir Robert Walpole, Prime Minister of Great Britain, 1721, 1742)Gender (Sir Robert Walpole, male)

Ambiguities are one of the biggest challenges for NERC systems. These can affect both the recognition and the classification component, and sometimes even both simultaneously. For example, the word May can be a proper noun (named entity) or a common noun (not an entity, as in the verbal use you may go), but even when a proper noun, it can fall into various categories (month of the year, part of a person’s name (and furthermore a first name or surname), or part of an organization name). Very frequent categorization problems occur with the distinction between Person and Organization, since many companies are named after people (e.g., the clothing company Austin Reed). Similarly, many things which may not be named entities, such as names of diseases and laws, are named after people too. While technically one could annotate the person’s name here, it is not usually desirable (we typically do not care about annotating Parkinson as a Person in the term Parkinson’s disease or Pythagoras in Pythagoras’ Theorem).

3.5 RELATED TASKS

Temporal normalization takes the recognition of temporal expressions (NEs classified as Date or Time) a step further, by mapping them onto a standard date and time format. Temporal normalization, and in particular that of relative dates and times, is critical for event recognition tasks. The task is quite easy if a text already refers to time in an absolute way, e.g., “8am.” It becomes more challenging, however, if a text refers to time in a relative way, e.g., “last week.” In this case we first have to find the date the text was created, so that it can be used as a point of reference for the relative temporal expression. One of the most popular annotation schema for temporal expressions is TimeML [³³]. Most NERC tools do not include temporal normalization as a standard part of the NERC process, but some tools have additional plugins that can be used. GATE, for example, has a Date Normalizer plugin that can be added to ANNIE in order to perform this task. It also has a temporal annotation plugin, GATE-Time, based on the HeidelTime tagger [³⁴], and which conforms to TimeML, an ISO standard for temporal semantic annotation of documents [³⁵]. SUTime [³⁶] is another library for recognizing and normalizing time expressions, available as part of the Stanford CoreNLP pipeline. It makes use of a deterministic rule-based system, and thus is easily extendable. It produces a set of annotations with one of four temporal types (DATE, TIME, DURATION, and SET), which correspond to the TIMEX3 standard for type and value. The slightly unusual “SET” type refers to a set of times, such as a recurring event.

Co-reference resolution aims at connecting together different mentions of the same entity. This task is important because it helps with finding relations between entities later, and it also helps with named entity linking. The different mentions may be identical references, in which case the task is easy, or the task may be more complicated because the same entity can be mentioned in different ways. For example, John Smith, Mr. Smith, John, J. S. Smith, and Smith could all refer to the same person. Similarly, we may have acronyms (U.K. and United Kingdom) or even aliases which bear no surface resemblance to their alternative name (IBM and The Big Blue). With the exception of the latter form, where lists of explicit name pairs are often the best solution, rule-based systems tend to be quite effective for this task. For example, even though acronyms are often highly ambiguous, in the context of the same document it is rare that an acronym and a longer name that matches the relevant letters would not be a match. Of course, explicit lists of pairs can also be used; similarly, lists of exceptions can also be added. ANNIE’s Orthomatcher is a good example of a co-reference tool which relies entirely on hand-coded rules, performing on news texts with around 95% accuracy [³⁷]. The Stanford Coref tool is integrated in the Stanford CoreNLP pipeline, and implements the multi-pass sieve co-reference and anaphor resolution system described in [³⁸]. SANAPHOR [³⁹] extends this further by adding a semantic layer on top of this and improving the results. It takes as input co-reference clusters generated by Stanford Coref, and then splits those containing unrelated mentions, and merges those which should belong together. It uses the output from an NEL process involving DBpedia/YAGO to disambiguate those mentions which are linked to different entities, and merges those which are linked to the same one. It can also be used with other NERC and NEL tools as input.

3.6 APPROACHES TO NERC

Approaches to NERC can be roughly divided into rule- or pattern-based, and machine learning or statistical extraction methods [⁴⁰], although quite often the two techniques are mixed (see [⁴¹][⁴²][⁴³]). Most learning-based techniques rely on some form of human supervision, with the exception of purely structural IE techniques performing unsupervised machine learning on unannotated documents [⁴⁴]. As we have already seen, language engineering platforms, such as GATE, Stanford CoreNLP, OpenNLP, and NLTK, enable the modular implementation of techniques and algorithms for information extraction, by inserting different pre-processing and NERC modules into the pipeline, thereby allowing repeatable experimentation and evaluation of their results. An example of a typical processing pipeling for NERC is shown in Figure 3.1.

Figure 3.1: Typical NERC pipeline.

3.6.1 RULE-BASED APPROACHES TO NERC

Linguistic rule-based methods for NERC, such as those used in ANNIE, GATE’s information extraction system, typically comprise a combination of gazetteer lists and hand-coded pattern-matching rules. These rules use contextual information to help determine whether candidate entities from the gazetteers are valid, or to extend the set of candidates. The gazetteer lists act as a starting point from which to establish, reject, or refine the final entity to be extracted. A typical NERC processing pipeline consists of linguistic pre-processing (tokenization, sentence splitting, POS tagging) as described in the pervious chapter, followed by entity finding using gazetteers and grammars, and then co-reference resolution.

Gazetteer lists are designed for annotating simple, regular features such as known names of companies, locations, days of the week, famous people, etc. A typical set of gazetteers for NERC might contain hundreds or even thousands of entries. However, using gazetteers alone is insufficient for recognizing and classifying entities, because on the one hand many names are too ambiguous (e.g., “London” could be part of an Organization name, a Person name, or just the Location), and on the other hand they cannot specify every named entity (e.g., in English one cannot pre-specify every single possible surname). When gazetteers are combined with other linguistic pre-processing annotations (part-of-speech tags, capitalization, other contextual evidence), however, they can be very powerful.

Using pattern matching for NERC requires the development of patterns over multi-faceted structures that consider many different properties of words, such as orthography (capitalization), morphology, part-of-speech information and so on. Traditional pattern-matching languages, such as PERL, quickly become unmanageable due to complexity, when used for such tasks. Therefore, attribute-value notations are normally used, that allow for conditions to refer to token attributes arising from multiple analysis levels. An example of this is JAPE, the Java-based pattern matching language used in GATE, based on CPSL [⁴⁵]. JAPE employs a declarative notation that allows for context-sensitive rules to be written and for non-deterministic pattern matching to be performed. The rules are divided into phases (subsets) which run sequentially; each phase typically consists of rules for the same entity type (e.g., Person) or rules that have the same specific requirements for their being run. A variety of priority mechanisms enable dealing with competing rules, which make it possible to handle ambiguity: for example, one can prefer patterns occurring in a particular context, or one can prefer a certain entity type over another in a given situation. Other rule-based mechanisms work in a similar way.

A typical simple pattern-matching rule might try to match all university names, e.g., University of Sheffield, University of Bristol, where the pattern consists of the specific words University of followed by the name of a city. From the gazetteer, we can check for the mention of a city name such as Sheffield or Bristol. A more complex rule might try to identify the name of any organization by looking for a keyword from a gazetteer list, such as Company, Organization, Business, School, etc. occurring together with one or more proper nouns (as found by the POS Tagger), and potentially also containing some function words. While these kinds of rules are quite good at matching typical patterns (and work very well for some entity types such as Persons, Locations, and Dates), they can be highly ambiguous. Compare for example the company name General Motors, the person name General Carpenter, and the phrase Major Disaster (which does not denote any entity), and it can easily be seen that such patterns are insufficient. Learning approaches, on the other hand, may be good at recognizing that disaster is not typically part of a person or organization’s name, because it never occurs as such in the training corpus.

As mentioned above, rule-based systems are developed based on linguistic features, such as POS tags or context information. Instead of manually developing such rules, it is possible to label training examples, then automatically learn rules, using rule learning (also known as rule induction) systems. These automatically induce sets of rules from labeled training examples using supervised learning. They were popular among the early NERC learning systems, and include SRV [⁴⁶], RAPIER [⁴⁷], WHISK [⁴⁸], BWI [⁴⁹], and LP ² [⁵⁰].

3.6.2 SUPERVISED LEARNING METHODS FOR NERC

Rule learning methods were historically followed by supervised learning approaches, which learn weights for features, based on their probability of appearing with negative vs. positive training examples for specific NE types. The general supervised learning approach consists of five stages:

• linguistic pre-processing;

• feature extraction;

• training models on training data;

• applying models to test data;

• post-processing the results to tag the documents.

Linguistic pre-processing at the minimal level includes tokenization and sentence splitting. Depending on the features used, it can also include morphological analysis, part-of-speech tagging, co-reference resolution, and parsing, as described in Chapter 2. Popular features include:

• Morphological features: capitalization, occurrence of special characters (e.g., $, %);

• Part-of-speech features: tags of the occurrence;

• Context features: words and POS of words in a window around the occurrence, usually of 1–3 words;

• Gazetteer features: appearance in NE gazetteers;

• Syntactic features: features based on parse of sentence;

• Word representation features: features based on unsupervised training on unlabeled text using e.g., Brown clustering or word embeddings.

Statistical NERC approaches use a variety of models, such as Hidden Markov Models (HMMs) [⁵¹], Maximum Entropy models [⁵²], Support Vector Machines (SVMs) [⁵³] [⁵⁴] [⁵⁵], Perceptrons [⁵⁶][⁵⁷], Conditional Random Fields (CRFs) [58, 59], or neural networks [⁶⁰]. The most successful NERC approaches include those based on CRFs and, more recently, multilayer neural networks. We refer readers interested in learning more about those machine learning algorithms to [61, 62].

CRFs model NERC as a sequence labeling approach, i.e., the label for a token is modeled as dependent on the label of preceding and following tokens in a certain window. Examples of frameworks which are available for CRF-based NERC are Stanford NER³ and CRFSuite.⁴ Both are distributed with feature extractors and models trained on the ConLL 2003 data [²⁸].

Multi-layer neural network approaches have two advantages. First, they learn latent features, meaning they do not require linguistic processing beyond sentence splitting and tokenization. This makes them more robust across domains than architectures based on explicit features, since they do not have to compensate for mistakes made during pre-processing. Second, they can easily incorporate unlabeled text, on which representation feature extraction methods can be trained. The state-of-the-art system for NERC, SENNA [⁶⁰], uses such a multi-layer neural network architecture with unsupervised pre-training. It is available as a stand-alone distribution⁵ or as part of the DeepNL framework.⁶ Like the frameworks above, it is distributed with feature extractors and offers functionality for training models on new data.

There are advantages and disadvantages to a supervised learning approach for NERC compared with a knowledge engineering, rule-based approach. Both require manual effort—rule-based approaches require specialist language engineers to develop hand-coded rules, whereas supervised learning approaches require annotated training data, for which language engineers are not needed. Which stream of approach is better suited for an application scenario is dependent on the application and the domain. For popular domains, such as newswire, hand-labeled training data is already available, whereas for others, it might need to be created from scratch. If the linguistic variation in the text is very small and quick results are desired, hand-coding rules might be a better starting point.

3.7 TOOLS FOR NERC

GATE’s general purpose named entity recognition and classification system, ANNIE, is a typical example of a rule-based system. It was designed for traditional NERC on news texts but, being easily adaptable, can form also the starting point for new NERC applications in other languages and for other domains. GATE contains tools for ML, so can be used to train models for NERC also, based on the pre-processing components described in Chapter 2. Other well known systems are UIMA,⁷ developed by IBM, which focuses more on architectural support and processing speed, and offers a number of similar resources to GATE; OpenCalais,⁸ which provides a web service for semantic annotation of text for traditional named entity types, and LingPipe⁹ which provides a (limited) set of machine learning models for various tasks and domains. While these are very accurate, they are not easily adaptable to new applications. Components from all these tools are actually included in GATE, so that a user can mix and match various resources as needed, or compare different algorithms on the same corpus. However, the components provided are mainly in the form of pre-trained models, and do not typically offer the full functionality of the original tools.

The Stanford NER package, included in the Stanford CoreNLP pipeline, is a Java implementation of a Named Entity Recognizer. It comes with well-engineered feature extractors for NERC, and has a number of options for defining these. In addition to the standard 3-class model (Person, Organization, Location), it also comes with other models for different languages and models trained on different sets. The methodology used is a general implementation of linear chain Conditional Random Field (CRF) sequence models, and thus the user can easily retrain it on any labeled data they have. The Stanford NER package is also used in NLTK, which does not have its own NERC tool.

OpenNLP contains a NameFinder module for English NERC which has separate models for the standard 7-type MUC classification (Person, Organization, Location, Date, Time, Money, Percent), trained on standard freely available datasets. It also has models for Spanish and Dutch trained on CONLL data. As with the Stanford NER tool, the user can easily retrain the NameFinder on any labeled dataset. Similarly to the other learning-based tools mentioned above, because they rely on supervised learning, these tools work well only when large amounts of annotated training data are available, so applying them to new domains and text types can be quite problematic if such data does not exist.

An example of a system that performs fine-grained NERC is FIGER [⁶³],¹⁰ which is trained on Wikipedia. The tag set for FIGER is made up of 112 types, which are derived from Freebase by selecting the most frequent types and merging fine-grained types. The goal is to perform multi-class multi-label classification, i.e., each sequence of words is assigned one or several of multiple types, or no type. Training data for FIGER is created by exploiting the anchor text of entity mentions annotated in Wikipedia, i.e., for each sequence of words in a sentence, the sequence is automatically mapped to a set of Freebase types and used as positive training data for those types. The system is trained using a two-step process: training a CRF model for named entity boundary recognition, then an adapted perceptron algorithm for named entity classification. Typically, a CRF model would be used for doing both at once (e.g. [⁶⁴]), but this is avoided here due to the large set of NE types. As for the other NERC tools, it can easily be retrained on new data.

3.8 NERC ON SOCIAL MEDIA

Research on NERC in tweets is currently a hot research area, since there are many tasks which rely on the analysis of social media, as we will discuss in Chapter 8. Social media is a particular challenge for NERC due to its noisy nature (incorrect spelling, punctuation, capitalization, novel use of words, etc.), which affects both the pre-processing components required (and thus has a knock-on effect on the NERC component performance) and the named entities themselves, which become harder to recognize. Due to the lack of annotated corpora, performing NERC on social media data using a learning approach is generally viewed as a domain adaptation problem from newswire text, often integrating the two kinds of data for training [⁶⁵] and including a tweet normalization step [⁶⁶]. One particular challenge is recency: the kinds of NEs that we want to recognize in social media are often newly emerging (recent news stories about people who were not previously famous, for example) and so are not typically found in gazetteers or even in Linked Data sets such as DBpedia. Another challenge is that a diverse context [⁶⁷] as well as a smaller context window [⁶⁸] make NERC more difficult: unlike in longer news articles, there is a low amount of discourse information per tweet, and threaded structure is fragmented across multiple documents, flowing in multiple directions. NERC from social media will be discussed explicitly in Chapter 8.

3.9 PERFORMANCE

In general, NERC performance is lower than performance of NLP pre-processing tasks, such as POS tagging, but can still reach F1 scores above 90%. NERC performance depends on a variety of factors, including the type of text (e.g., newswire, social media), the NE type (e.g., PER, LOC, ORG), the size of the available training corpus and, most notably, how different the corpus the NER was developed on is from the text the NERC is applied to [⁶⁹]. In the context of NERC evaluation campaigns, the task is typically to train and test systems on different splits of the same corpus (also called in-domain performance), meaning the test corpus is very similar to the training corpus.

To give an indication of such in-domain NERC performance, the current state-of-the-art result on ConLL 2003, the most popular newswire corpus with NERC annotations, is an F1 of 90.10%. The best-performing system is currently [⁷⁰].¹¹ On the other hand, the winning tool for NERC for the social media domain in the 2015 shared task WNUT [71, 72] only achieved 56.41% F1, and 70.63 for NER. It is clear that NERC is much more difficult than NER, and that NERC for existing social media corpora is more challenging than for newswire corpora. Notably, the corpora also differ in size, which is fairly typical. Large NERC-annotated corpora exist for the newswire genre, but these are still largely lacking for the social media genre. This is a big part of the reason that performance on social media corpora is so much worse [⁶⁹].

In real-world or application scenarios, such an in-domain setting as described above typically does not apply. Even if a hand-annotated NERC corpus is created for the specific application at some point, the test data might change. Typically, the greater the time difference between the creation time of a training corpus and test data, the less useful it is for extracting NEs from that test corpus [⁶⁹]. This is particularly true for the social media genre, where entities change very quickly. In practice this means that after a couple of years, training data can be rendered almost useless.

3.10 SUMMARY

In this chapter, we have described the task of Named Entity Recognition and Classification and its two subtasks of boundary identification and classification into entity types. We have shown why the linguistic techniques described in the previous chapter are required for the task, and how they are used in both rule-based and machine-learning approaches. Like most of the following NLP tasks we describe in the rest of the book, this is the point at which tasks begin to get more complicated. The linguistic pre-processing tasks all essentially have a very similar goal and definition which does not vary according to what they will be used for. NE recognition and other tasks, such as relation extraction, sentiment analysis, etc., vary enormously in their definition, depending on why they are required. For example, the classes of NEs may differ widely from the standard MUC types of Person, Organization, and Location to a much more fine-grained classification involving many more classes and thus making the task very different. From there one can also go a stage further and perform a more semantic form of annotation, linking entities to external data sources such as DBpedia and Freebase, as will be described in Chapter 5. Despite this, methods for NERC are typically reusable (at least to some extent) even when the task itself varies substantially, although for example some kinds of learning methods may work better for different levels of classification. In the following chapter, we look at how named entities can be connected via relations, such as authors and their books, or employees and their organizations.

¹ http://nerd.eurecom.fr/ontology

Natural Language Processing for the Semantic Web

Подняться наверх