Читать книгу Linked Lexical Knowledge Bases - Iryna Gurevych - Страница 9

Оглавление

Preface

MOTIVATION

Lexical Knowledge Bases (LKBs) are indispensable in many areas of natural language processing (NLP). They strive to encode the human knowledge of language in machine-readable form, and as such they are required as a reference when machines are supposed to interpret natural language in accordance with the human perception. Examples for such tasks are word sense disambiguation (WSD) and information retrieval (IR). The aim of WSD is to determine the correct meaning of ambiguous words in context, and in order to formalize this task, a so-called sense inventory is required, i.e., a resource encoding the different meanings a word can express. In IR, the goal is to retrieve, given a user query formulating a specific information need, the documents from a collection which fulfill this need best. Here, knowledge is also necessary to correctly interpret short and often ambiguous queries, and to relate them to the set of documents.

Nowadays, LKBs exist in many variations. For instance, the META-SHARE repository⁴ lists over 1,000 different lexical resources, and the LRE Map⁵ contains more than 3,900 resources which have been proposed as a knowledge source for natural language processing systems. A main distinction, which is also made in this book, is between expert-built and collaboratively constructed resources. While the distinction is not always clean-cut, the former are generally resources which are created by a limited set of expert editors or professionals using their personal introspection, corpus evidence, or other means to obtain the knowledge. Collaboratively constructed resources, on the other hand, are open for every volunteer to edit, with no or only few restrictions such as registration for a website. Intuitively, the quality of the entries should be lower when laypeople are involved in the creation of a resource, but it has been shown that the collaborative process of correcting errors and extending articles (also known as the “wisdom of the crowds”; Surowiecki [2005]) can lead to results of remarkable quality [Giles, 2005]. The most prominent example is Wikipedia, the largest encyclopedia and one of the largest knowledge sources known. Although originally not meant for that purpose, it has also become a major source of knowledge for all kinds of NLP applications, many of which we will discuss in this book [Medelyan et al., 2009].

Apart from the basic distinction with regard to the production process, LKBs exist in many flavors. Some are focusing on encyclopedic knowledge (Wikipedia), others resemble language dictionaries (Wiktionary) or aim to describe the concepts used in human language and the relationships between them from a psycholinguistic (Princeton WordNet [Fellbaum, 1998a]) or a semantic (FrameNet [Ruppenhofer et al., 2010]) perspective. Another important distinction is between monolingual resources, i.e., those covering only one language, and multilingual ones, which not only feature entries in different languages but usually also provide translations. However, despite the large number of existing LKBs, the growing demand for large-scale LKBs in different languages is still not met. While Princeton WordNet has emerged as a de facto standard for English NLP, for most languages corresponding resources are either considerably smaller or missing altogether. For instance, the Open Multilingual Wordnet project lists only 25 wordnets in languages other than English, and only few of them (like the Finnish or Polish versions) match or surpass Princeton WordNet’s size [Bond and Foster, 2013]. Multilingual efforts such as Wiktionary or OmegaWiki provide a viable option for such cases and seem especially suitable for smaller languages due to their open construction paradigm and low entry requirements [Matuschek et al., 2013], but there are still considerable gaps in coverage which the corresponding language communities are struggling to fill.

A closely related problem is that, even if comprehensive resources are available for a specific language, there usually does not exist a single resource which works best for all application scenarios or purposes, as different LKBs cover not only different words and senses, but sometimes even completely different information types. For instance, the knowledge about verb classes (i.e., groups of verbs which share certain properties) contained in VerbNet is not covered by WordNet, although it might be useful depending on the task, for example to provide subcategorization information when parsing low frequency verbs.

These considerations have led to the insight that, to make the best possible use of the available knowledge, the orchestrated exploitation of different LKBs is necessary. This lets us not only extend the range of covered words and senses, but more importantly, gives us the opportunity to obtain a richer knowledge representation when a particular meaning of a word is covered in more than one resource.

Examples where such a joint usage of LKBs proved beneficial include WSD using aligned WordNet and Wikipedia in BabelNet [Navigli and Ponzetto, 2012a], semantic role labeling (SRL) using a mapping between PropBank, VerbNet and FrameNet [Palmer, 2009], and the construction of a semantic parser using a combination of FrameNet, WordNet, and VerbNet [Shi and Mihalcea, 2005]. These combined resources, known as Linked Lexical Knowledge Bases (LLKB), are the focus of this book, and we shed light on their different aspects from various angles.

TARGET AUDIENCE AND FOCUS

This book is intended to convey a fundamental understanding of Linked Lexical Knowledge Bases, in particular their construction and use, in the context of NLP. Our target audience are students and researchers from NLP and related fields who are interested in knowledge-based approaches. We assume only basic familiarity with NLP methods and thus this book can be used both for self-study and for teaching at an introductory level.

Note that the focus of this book is mostly on sense linking between general-purpose LKBs, which are most commonly used in NLP. While we acknowledge that there are many efforts of linking LKBs, for instance, to ontologies or domain-specific resources, we only discuss them briefly where appropriate and provide references for readers interested in these more specific linking scenarios. The same is true for the recent efforts in creating ontologies from LKBs and formalizing the relationships between them—while we give an introduction to this topic in Section 1.3, we realize that this diverse area of research deserves a book of its own, which indeed has been published recently [Chiarcos et al., 2012]. Our attention is rather on the actual algorithmic linking process, and the benefits it brings for applications. Furthermore, we put an emphasis on monolingual linking efforts (i.e., between resources in the same language), as the vast majority of algorithms have covered this scenario in the past and cross-lingual approaches were mostly direct derivatives thereof, for instance by introducing machine translation as an intermediate component (cf. Chapter 3). Nevertheless, we recognize the increasing importance of multilingual NLP and thus provide a dedicated chapter covering applications in this area (Chapter 6).

OUTLINE

After providing a brief description of the typographic conventions which we applied throughout this book, we start by introducing and comparatively analyzing a selection of LKBs which have been widely used in NLP (Chapter 1). Our description of these LKBs provides a foundation for the main part of this book, where their integration into LLKBs is considered from various different angles. We include expert-built LKBs, such as WordNet, as well as collaboratively constructed resources, such as Wikipedia and Wiktionary, and also cover established standards and representation formats which are relevant in this context.

Then, in Chapter 2, we give a more formal definition of LLKBs, and also of word sense linking, which is crucial for combining different resources semantically, and thus is of utmost importance. We go on by describing various LLKBs which have been suggested, putting a focus on current large-scale projects which dominate the field, but also considering smaller, more specialized initiatives which have yielded important insights and paved the way for large-scale resource integration.

In Chapter 3, we approach the core issue of automatic word sense linking. While the notion of similar or even equivalent word senses in different resources is intuitively understandable and often (but now always) quite easily grasped by humans, it poses a complex challenge for automatic processing due to word ambiguities, different sense granularities and information types [Navigli, 2006]. First, to contextualize the challenge, we describe some related tasks in NLP and other fields, and outline how word sense linking relates to them. Then, we discuss in detail different ways to automatically create sense links between LKBs, based on textual descriptions of senses (i.e., glosses), the structure of the resources, or a combination thereof. The broader context of LLKBs lies of course not in the mere linking of resources for its own sake, but in the potential it holds for NLP applications.

Thus, in the following chapters, we present a selection of methods and applications where the use of LLKBs leads to particular benefits for NLP. In Chapter 4, we describe how the disambiguation of textual units benefits from the richer structure and combined knowledge, and also how the clustering of fine-grained word senses by exploiting 1:n links improves WSD accuracy. Building on that, we present more advanced disambiguation techniques in Chapter 5, including a discussion of using LLKBs for distant supervision and in neural vector space models, which are two recent and especially promising topics in machine learning for NLP. In Chapter 6 we briefly present multilingual applications, and computer-aided translation in particular, and show how they benefit from linked multilingual resources. Finally, in Chapter 7, we supplement our considerations of LLKB applications by discussing the enabling technologies, i.e., how LLKBs can be accessed via user interfaces and application programming interfaces. Based on the discussion of access paths for single resources, we describe how interfaces for current complex linked resources have evolved to cater to the needs of researchers and end users.

Chapter 8 concludes this book and points out directions for future work.

TYPOGRAPHIC CONVENTIONS

• Newly introduced terms and example lemmas are typed in italics.

• Synsets (groups of synonymous words) are enclosed by curly brackets, e.g., {car, automobile}.

• Concepts are typed in small caps, e.g., STREET VEHICLE WITH FOUR WHEELS.

• Relations between senses are written as pairs in parentheses, e.g., (car, vehicle).

• Classes of the Lexical Markup Framework (LMF) standard are printed in a monospace font starting with an upper case letter (e.g., LexicalEntry).

• LMF data categories are printed in a monospace font starting with a lower case letter (e.g., partOfSpeech).

We acknowledge support by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806, by the German Institute for Educational Research (DIPF), and by the German Research Foundation under grant No. GU 798/17-1. We also thank our colleagues and students for their contributions to this book.

Iryna Gurevych, Judith Eckle-Kohler, and Michael Matuschek

July 2016

⁴ http://www.meta-share.eu

⁵ http://www.resourcebook.eu

Подняться наверх