Читать книгу Linked Lexical Knowledge Bases - Iryna Gurevych - Страница 8

Оглавление

Foreword

Lexical semantic knowledge is vital for most tasks in natural language processing (NLP). Such knowledge has been captured through two main approaches. The first is the knowledge-based approach, in which human linguistic knowledge is encoded directly in a structured form, resulting in various types of lexical knowledge bases. The second is the corpus-based approach, in which lexical semantic knowledge is learned from corpora and then represented in either explicit or implicit manners.

Historically, the knowledge-based approach preceded the corpus-based one, while the latter has been dominating the center-stage of NLP research in the last decades. Yet, the development and use of lexical knowledge bases (LKBs) continued to be a major thread. An illustration of this fact may be found in the number of citations for the fundamental 1998 WordNet book [Fellbaum, 1998a], over 12,000 at the time of writing (according to Google Scholar), which somewhat exceeds the number of citations for the primary text book on statistical NLP from about the same period [Manning and Schütze, 1999]. Despite the overwhelming success of corpus-based methods, whether supervised or unsupervised, their output may be quite noisy, particularly when it comes to modeling fine-grained lexical knowledge such as distinct word senses or concrete lexical semantic relationships. Human encoding, on the other hand, provides more precise knowledge at the fine-grained level. The ongoing popular use of LKBs, and particularly of WordNet, seems to indicate that they still provide substantial complementary information relative to corpus-based methods (see Shwartz et al. [2015] for a concrete evaluation showing the complementary behavior of corpus-based word embeddings and information from multiple LKBs).

While WordNet has been by far the most widely-used lexical resource, it does not provide the full spectrum of needed lexical knowledge, which brings us to the theme of the current book. As reviewed in Chapter 2, additional lexical information has been encoded in quite a few LKBs, either by experts or by web communities through collaborative efforts. In particular, collaborative resources provide the opportunity to obtain much larger and more frequently updated resources than is possible with expert work. Knowledge resources like Wikipedia¹ or Wikidata² include vast lexical information about individual entities and domain specific terminology across many domains, which falls beyond the scope of WordNet. Hence, it would be ideal for NLP technology to utilize in an integrated manner the union of information available in a multitude of lexical resources. As an illustrating example, consider an application setting, like a question answering scenario, which requires knowing that Deep Purple was a group of people. We may find in Wikipedia that it was a “band,” map this term to its right sense in WordNet and then follow a hypernymy chain to “organization,” whose definition includes “a group of people.”

As hinted in the above example, to allow such resource integration we need effective methods for linking, or aligning, the word senses or concepts encoded in various resources. Accordingly, the main technical focus of this book is about existing resource integration efforts, resource linking algorithms, and the utility of such algorithms within disambiguation tasks. Hence, this book would first be of high value for researchers interested in creating or linking LKBs, as well as for developers of NLP algorithms and applications who would like to leverage linked lexical resources. An important aspect is the development and use of linked lexical resources in multiple languages, addressed in Chapter 7.

Looking forward, may be the most interesting research prospect for linked lexical knowledge bases is their integration with corpus-based machine learning approaches. A relatively simple form of combining the information in LKBs with corpus-based information is to use the former, via distant supervision, to create training data for the latter (discussed in Section 6.2). A more fundamental research direction is to create a unified knowledge representation framework, which integrates directly the human-encoded information in LKBs with information obtained by corpus-based methods. A promising framework for such integrated representation has emerged recently, under the “embedding” paradigm, where dense continuous vectors are used to represent linguistic objects, as reviewed in Section 6.3. Such representations, i.e., embeddings, have been initially created separately from corpus data—based on corpus co-occurrences, as well as from knowledge bases—based on and leveraging their rich internal structure. Further research suggested methods for creating unified representations, based on hybrid objective functions that consider both corpus and knowledge base structure. While this research line is still in initial phases, it has the potential to truly integrate corpus-based and human-encoded knowledge, and thus unify these two research endeavors which have been pursued mostly separately in the past. From this perspective, and assuming that human-encoded lexical knowledge can provide useful additional information on top of corpus-based information, the current book should be useful for any researcher who aims to advance state of the art in lexical semantics.

While considering the integration of implicit corpus-based and explicit human-encoded information, we may notice that the joint embedding approach goes the “implicit way.” While joint embeddings do encode information coming from both types of resources, this information is encoded in opaque continuous vectors, which are not immediately interpretable, thus losing the transparency of the original symbolically-encoded human knowledge. Indeed, developing methods for interpreting embedding-based representations is an actively pursued theme, but it is yet to be seen whether such attempts will succeed to preserve the interpretability of LKB information. Alternatively, one might imagine developing integrated corpus-based and knowledge-based representations that would inherently involve explicit symbolic representations, even though, currently, this might be seen as wishful thinking.

Finally, one would hope that the current book, and work on new lexical representations in general, would encourage researchers to better connect the development of knowledge resources with generic aspects of their utility for NLP tasks. Consider for example the common use of the lexical semantic relationships in WordNet for lexical inference. Typically, WordNet relations are utilized in an application to infer the meaning of one word from another in order to bridge lexical gaps, such as when different words are used in a question and in an answer passage. While this type of inference has been applied in numerous works, surprisingly there are no well-defined methods that indicate how to optimally exploit WordNet for lexical inference. Instead, each work applies its own heuristics, with respect to the types of WordNet links that should be followed, the length of link chains, the senses to be considered, etc. In this state of affairs, it is hard for LKB developers to assess which components of the knowledge and representations that they create are truly useful. Similar challenges are faced when trying to assess the utility of vector-based representations.³

Eventually, one might expect that generic methods for utilizing and assessing lexical knowledge representations would guide their development and reveal their optimal form, based on either implicit or explicit representations, or both.

Ido Dagan

Department of Computer Science

Bar-Ilan University, Israel

¹ https://www.wikipedia.org

² https://www.wikidata.org

³One effort to address these challenges is the ACL 2016 workshop on Evaluating Vector Space Representations for NLP, whose mission statement is “To develop new and improved ways of measuring the quality or understanding the properties of vector-space representations in NLP.” https://sites.google.com/site/repevalacl16/.

Подняться наверх