Читать книгу Natural Language Processing for the Semantic Web - Diana Maynard - Страница 9

Оглавление

CHAPTER 1

Introduction

Natural Language Processing (NLP) is the automatic processing of text written in natural (human) languages (English, French, Chinese, etc.), as opposed to artificial languages such as programming languages, to try to “understand” it. It is also known as Computational Linguistics (CL) or Natural Language Engineering (NLE). NLP encompasses a wide range of tasks, from low-level tasks, such as segmenting text into sentences and words, to high-level complex applications such as semantic annotation and opinion mining. The Semantic Web is about adding semantics, i.e., meaning, to data on the Web, so that web pages can be processed and manipulated by machines more easily. One central aspect of the idea is that resources are described using unique identifiers, called uniform resource identifiers (URIs). Resources can be entities, such as “Barack Obama,” concepts such as “Politician” or relations describing how entities relate to one another, such as “spouse-of.” NLP techniques provide a way to enhance web data with semantics, for example by automatically adding information about entities and relations and by understanding which real-world entities are referenced so that a URI can be assigned to each entity.

The goal of this book is to introduce readers working with, or interested in, Semantic Web technologies, to the topic of NLP and its role and importance in the field of the Semantic Web. Although the field of NLP has existed long before the advent of the Semantic Web, it has only been in recent years that its importance here has really come to the fore, in particular as Semantic Web technologies move toward more application-oriented realizations. The purpose of this book is therefore to explain the role of NLP and to give readers some background understanding about some of the NLP tasks that are most important for Semantic Web applications, plus some guidance about choosing methods and tools that fit most appropriately for a particular scenario. Ultimately, the reader should come away armed with the knowledge to understand the main principles and, if necessary, to choose suitable NLP technologies that can be used to enhance their Semantic Web applications.

The overall structure of the book is as follows. We first describe some of the core low-level components, in particular those which are commonly found in open source NLP toolkits and used widely in the community. We then show how these tools can be combined and used as input for the higher-level tasks such as Information Extraction, semantic annotation, social media analysis, and opinion mining, and finally how applications such as semantically enhanced information retrieval and visualization, and the modeling of online communities, can be built on top of these.

One point we should make clear is that when we talk about NLP in this book, we are referring principally to the subtask of Natural Language Understanding (NLU) and not to the related subtask of Natural Language Generation (NLG). While NLG is a useful task which is also relevant to the Semantic Web, for example in relaying the results of some application back to the user in a way that they can easily understand, and particularly in systems that require voice output of results, it goes outside the remit of this book, as it employs some very different techniques and tools. Similarly, there are a number of other tasks which typically fall under the category of NLP but are not discussed here, in particular those concerned with speech rather than written text. However, many applications for both speech processing and natural language generation make use of the low-level NLP tasks we describe. There are also some high-level NLP-based applications that we do not cover in this book, such as Summarization and Question Answering, although again these make use of the same low-level tools.

Most early NLP tools such as parsers (e.g., Schank’s conceptual dependency parser [¹]) were rule-based, due partly to the predominance of certain linguistic theories (primarily those of Noam Chomsky [²]), but also due to the lack of computational power which made machine learning methods infeasible. In the 1980s, machine learning systems started coming to the fore, but were still mainly used just to automatically create sets of rules similar to existing manually developed rule systems, using techniques such as decision trees. As statistical models became more popular, particularly in fields such as Machine Translation and Part-of-Speech tagging, where hard rule-based systems were often not sufficient to resolve ambiguities, Hidden Markov Models (HMMs) became popular, introducing the idea of weighted features and probablistic decision-making. In the last few years, deep learning and neural networks have also become very popular, following their spectacular success in the field of image recognition and computer vision (for example in the technology behind self-driving cars), although their success for NLP tasks is currently nowhere near as dramatic. Deep learning is essentially a branch of Machine Learning that uses multiple hierarchical levels of features that are learned in an unsupervised fashion. This makes it very suitable for working with big data, because it is fast and efficient, and does not require the manual creation of training data, unlike supervised machine learning systems. However, as will be demonstrated throughout the course of this book, one of the problems of NLP is that tools almost always need adapting to specific domains and tasks, and for real-world applications this is often easier with rule-based systems. In most cases, combinations of different methods are used, depending on the task.

1.1 INFORMATION EXTRACTION

Information extraction is the process of extracting information and turning it into structured data. This may include populating a structured knowledge source with information from an unstructured knowledge source [³]. The information contained in the structured knowledge base can then be used as a resource for other tasks, such as answering natural language queries or improving on standard search engines with deeper or more implicit forms of knowledge than that expressed in the text. By unstructured knowledge sources, we mean free text, such as that found in newspaper articles, blog posts, social media, and other web pages, rather than tables, databases, and ontologies, which constitute structured text. Unless otherwise specified, we use the word text in the rest of this book to mean unstructured text.

When considering information contained in text, there are several types of information that can be of interest. Often regarded as the key components of text are proper names, also called named entities (NEs), such as persons, locations, and organizations. Along with proper names, temporal expressions, such as dates and times, are also often considered as named entities. Figure 1.1 shows some simple Named Entities in a sentence. Named entities are connected together by means of relations. Furthermore, there can be relations between relations, for example the relation denoting that someone is CEO of a company is connected to the relation that someone is an employee of a company, by means of a sub-property relation, since a CEO is a kind of employee. A more complex type of information is the event, which can be seen as a group of relations grounded in time. Events usually have participants, a start and an end date, and a location, though some of this information may be only implicit. An example for this is the opening of a restaurant. Figure 1.2 shows how entities are connected to form relations, which form events when grounded in time.

Figure 1.1: Examples of named entities.

Figure 1.2: Examples of relations and events.

Information extraction is difficult because there are many ways of expressing the same facts:

• BNC Holdings Inc. named Ms. G. Torretta as its new chairman.

• Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc.

• Ms. Gina Torretta took the helm at BNC Holdings Inc.

Furthermore, information may need to be combined across several sentences, which may additionally not be consecutive.

• After a long boardroom struggle, Mr. Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms. Torretta.

Information extraction typically consists of a sequence of tasks, comprising:

1. linguistic pre-processing (described in Chapter 2);

2. named entity recognition (described in Chapter 3);

3. relation and/or event extraction (described in Chapter 4).

Named entity recognition (NER) is the task of recognizing that a word or a sequence of words is a proper name. It is often solved jointly with the task of assigning types to named entities, such as Person, Location, or Organization, which is known as named entity classification (NEC). If the tasks are performed at the same time, this is referred to as NERC. NERC can either be an annotation task, i.e., to annotate a text with NEs, or the task can be to populate a knowledge base with these NEs. When the named entities are not simply a flat structure, but linked to a corresponding entity in an ontology, this is known as semantic annotation or named entity linking (NEL). Semantic annotation is much more powerful than flat NE recognition, because it enables inferences and generalizations to be made, as the linking of information provides access to knowledge not explicit in the text. When semantic annotation is part of the process, the information extraction task is often referred to as Ontology-Based Information Extraction (OBIE) or Ontology Guided Information Extraction (see Chapter 5). Closely associated with this is the process of ontology learning and population (OLP) as described in Chapter 6. Information extraction tasks are also a pre-requisite for many opinion mining tasks, especially where these require the identification of relations between opinions and their targets, and where they are based on ontologies, as explained in Chapter 7.

1.2 AMBIGUITY

It is impossible for computers to analyze language correctly 100% of the time, because language is highly ambiguous. Ambiguous language means that more than one interpretation is possible, either syntactically or semantically. As humans, we can often use world knowledge to resolve these ambiguities and pick the correct interpretation. Computers cannot easily rely on world knowledge and common sense, so they have to use statistical or other techniques to resolve ambiguity. Some kinds of text, such as newspaper headlines and messages on social media, are often designed to be deliberately ambiguous for entertainment value or to make them more memorable. Some classic examples of this are shown below:

• Foot Heads Arms Body.

• Hospitals Sued by 7 Foot Doctors.

• British Left Waffles on Falkland Islands.

• Stolen Painting Found by Tree.

In the first headline, there is syntactic ambiguity between the proper noun (Michael) Foot, a person, and the common noun foot, a body part; between the verb and plural noun heads, and the same for arms. There is also semantic ambgiuity between two meanings of both arms (weapons and body parts), and body (physical structure and a large collection). In the second headline, there is semantic ambiguity between two meanings of foot (the body part and the measurement), and also syntactic ambiguity in the attachment of modifiers (7 [Foot Doctors] or [7 Foot] Doctors). In the third example, there is both syntactic and semantic ambiguity in the word Left (past tense of the verb, or a collective noun referring to left-wing politicians). In the fourth example, there is ambiguity in the role of the preposition by (as agent or location). In each of these examples, for a human, one meaning is possible, and the other is either impossible or extremely unlikely (doctors who are 7-foot tall, for instance). For a machine, understanding without additional context that leaving pastries in the Falkland Islands, though perfectly possible, is an unlikely news item, is almost impossible.

1.3 PERFORMANCE

Due not only to ambiguity, but a variety of other issues, as will be discussed throughout the book, performance on NLP tasks varies widely, both between different tasks and between different tools. Reasons for the variable performance of different tools will be discussed in the relevant sections, but in general, the reason for this is that some tools are good at some elements of the task but bad at others, and there are many issues regarding performance when tools are trained on one kind of data and tested on another. The reason for performance between tasks varying so widely is largely based on complexity, however.

The influence of domain dependence on the effectiveness of NLP tools is an issue that is all too frequently overlooked. For the technology to be suitable for real-world applications, systems need to be easily customizable to new domains. Some NLP tasks in particular, such as Information Extraction, have largely focused on narrow subdomains, as will be discussed in Chapters 3 and 4. The adaptation of existing systems to new domains is hindered by various bottlenecks such as training data acquisition for machine learning–based systems. For the adaptation of Semantic Web applications, ontology bottlenecks may be one of the causes, as will be discussed in Chapter 6.

An independent, though related, issue concerns the adaptation of existing systems to different text genres. By this we mean not just changes in domain, but different media (e.g., email, spoken text, written text, web pages, social media), text type (e.g., reports, letters, books), and structure (e.g., layout). The genre of a text may be influenced by a number of factors, such as author, intended audience, and degree of formality. For example, less formal texts may not follow standard capitalization, punctuation, or even spelling formats, all of which can be problematic for the intricate mechanisms of IE systems. These issues will be discussed in detail in Chapter 8.

Many natural language processing tasks, especially the more complex ones, only become really accurate and usable when they are tightly focused and restricted to particular applications and domains. Figure 1.3 shows a three-dimensional tradeoff graph between generality vs. specificity of domain, complexity of the task, and performance level. From this we can see that the highest performance levels are achieved in language processing tasks that are focused on a specific domain and that are relatively simple (for example, identifying named entities is much simpler than identifying events).

Figure 1.3: Performance tradeoffs for NLP tasks.

In order to make feasible the integration of semantic web applications, there must be some kind of understanding reached between semantic web and NLP practitioners as to what constitutes a reasonable expectation. This is of course true for all applications where NLP should be integrated. For example, some applications involving NLP may not be realistically usable in the real world as standalone automatic systems without human intervention. This is not necessarily the case, however, for other kinds of semantic web applications which do not rely on NLP. Some applications are designed to assist a human user rather than to perform the task completely autonomously. There is often a tradeoff between the amount of autonomy that will most benefit the end user. For example, information extraction systems enable the end user to avoid having to read in detail hundreds or even thousands of documents in order to find the information they want. For humans to search manually through millions of documents is virtually impossible. On the other hand, the user has to bear in mind that a fully automated system will not be 100% accurate, and it is important for the design of the system to be flexible in terms of the tradeoff between precision and recall. For some applications, it may be more important to retrieve everything, although some of the information retrieved may be incorrect; on the other hand, it may be more important that everything retrieved is accurate, even if some things are missed.

1.4 STRUCTURE OF THE BOOK

Each chapter in the book is designed to introduce a new concept in the NLP pipeline, and to show how each component builds on the previous components described. In each chapter we outline the concept behind the component and give examples of common methods and tools. While each chapter stands alone to some extent, in that it refers to a specific task, the chapters build on each other. The first five chapters are therefore best read sequentially.

Chapter 2 describes the main approaches used for NLP tasks, and explains the concept of an NLP processing pipeline. The linguistic processing components comprising this pipeline—language identification, tokenization, sentence splitting, part-of-speech tagging, morphological analysis, and parsing and chunking—are then described, and examples are given from some of the major NLP toolkits.

Chapter 3 introduces the task of named entity recognition and classification (NERC), which is a key component of information extraction and semantic annotation systems, and discusses its importance and limitations. The main approaches to the task are summarized, and a typical NERC pipeline is described.

Chapter 4 describes the task of extracting relations between entities, explaining how and why this is useful for automatic knowledge base population. The task can involve either extracting binary relations between named entities, or extracting more complex relations, such as events. It describes a variety of methodologies and a typical extraction pipeline, showing the interaction between the tasks of named entity and relation extraction and discussing the major research challenges.

Chapter 5 explains how to perform entity linking by adding semantics into a standard flat information extraction system, of the kind that has been described in the preceding chapters. It discusses why this flat information extraction is not sufficient for many tasks that require greater richness and reasoning and demonstrates how to link the entities found to an ontology and to Linked Open Data resources such as DBpedia and Freebase. Examples of a typical semantic annotation pipeline and of real-world applications are provided.

Chapter 6 introduces the concept of automated ontology development from unstructured text, which comprises three related components: learning, population, and refinement. Some discussion of these terms and their interaction is given, the relationship between ontology development and semantic annotation is discussed, and some typical approaches are described, again building on the notions introduced in the previous chapters.

Chapter 7 describes methods and tools for the detection and classification of various kinds of opinion, sentiment, and emotion, again showing how the NLP processes described in previous chapters can be applied to this task. In particular, aspect-based sentiment analysis (such as which elements of a product are liked and disliked) can benefit from the integration of product ontologies into the processing. Examples of real applications in various domains are given, showing how sentiment analysis can also be slotted into wider applications for social media analysis. Because sentiment analysis is often performed on social media, this chapter is best read in conjunction with Chapter 8.

Chapter 8 discusses the main problems faced when applying traditional NLP techniques to social media texts, given their unusual and inconsistent usage of spelling, grammar, and punctuation amongst other things. Because traditional tools often do not perform well on such texts, they often need to be adapted to this genre. In particular, the core pre-processing components described in Chapters 2 and 3 can have a serious knock-on effect on other elements in the processing pipeline if errors are introduced in these early stages. This chapter introduces some state-of-the-art approaches for processing social media and gives examples of some real applications.

Chapter 9 brings together all the components described in the previous chapters by defining and describing a number of application areas in which semantic annotations are required, such as semantically enhanced information retrieval and visualization, the construction of social semantic user models, and modeling online communities. Common approaches and open source tools are described for these areas, including evaluation, scalability, and state-of-the-art results.

The concluding chapter summarizes the main concepts described in the book, and gives some discussion of the current state-of-the-art, major problems still to be overcome, and an outlook to the future.

Natural Language Processing for the Semantic Web

Подняться наверх