Читать книгу Linked Lexical Knowledge Bases - Iryna Gurevych - Страница 11

Оглавление

CHAPTER 1

Lexical Knowledge Bases

In this chapter we give an overview of different types of lexical knowledge bases that are used in natural language processing (NLP). We cover widely known expert-built Lexical Knowledge Bases (LKBs), and also collaborative LKBs, i.e., those created by a large community of layman collaborators. First we define our terminology, then we give a broad overview of various kinds of LKBs that play an important role in NLP. For particular resource-specific details, we refer the reader to the respective reference publications.

Definition Lexical Knowledge Base: Lexical knowledge bases (LKBs) are digital knowledge bases that provide lexical information on words (including multi-word expressions) of a particular language.1 By word, we mean word form, or more specifically, the canonical base word form which is called lemma. For example, write is the lemma of wrote. Most LKBs provide lexical information for lemmas. A lexeme is a word in combination with a part of speech (POS), such as noun, verb or adjective. The majority of LKBs specify the part of speech of the lemmas listed, i.e., provide lexical information on lexemes.

The pairings of lemma and meaning are called word senses or just senses. We use the terms meaning and concept synonymously in this book to refer to the possibly language-independent part of a sense. Each sense is typically identified by a unique sense identifier. For example, there are two meanings of the verb write which give rise to two different senses:2 (write, “to communicate with someone in writing”) and (write, “to produce a literary work”). Accordingly, a LKB might use identifiers, such as write01 and write02 to distinguish between the former and the latter sense. The set of all senses listed in a LKB is called its sense inventory.

Depending on their particular focus, LKBs can contain a variety of lexical information, including morphological, phonetic, syntactic, semantic, and pragmatic information. This book focuses on LKBs that provide lexical information on the word sense level, i.e., information that is sensitive to the meaning of a word and is therefore attached to a pairing of lemma and meaning rather than to the lemma itself. Not included in our definition are LKBs that only provide morphological information about the inflectional and derivational properties of words.

The following list provides an overview of the main lexical information types distinguished at the level of word senses.

Sense definition—A definition of the sense in natural language (also called gloss) meant for human interpretation; for example, “to communicate with someone in writing” is a sense definition for the sense write01 given above.

Sense examples—Example sentences which illustrate the sense in context; for example, He wrote her an email. is a sense example of the sense write01.

Sense relations—Lexical-semantic relations to other senses. We list the most salient ones.

Synonymy connects senses which are lexically different but share the same meaning. Synonymy is reflexive, symmetrical, and transitive. For example, the verbs change and modify are synonyms3 as they share the meaning “cause to change.”

Some resources such as WordNet subsume synonymous senses into synsets. However, for the linking algorithms presented in this book, we will usually not distinguish between sense and synset, as for most discussions and experiments in this particular context they can be used interchangeably.

Antonymy is a relation in which the source and target sense have opposite meanings (e.g., tall and small).

Hyponymy denotes a semantic relation where the target sense has a more specific meaning than the source sense (e.g., from limb to arm).

Hypernymy is the inverse relation of hyponymy and thus denotes a semantic relation in which the target sense has a more general meaning than the source sense.

Syntactic behavior—Lexical-syntactic properties, such as the valency of verbs, i.e., the number and type of syntactic arguments a verb takes; for example, the verb change (“cause to change”) can take a noun phrase subject and a noun phrase object as syntactic arguments, as in: She[subject] changed the rules[object].

In LKBs, valency is represented by subcategorization frames (short: subcat frames). They specify syntactic arguments of verbs, but also of other predicate-like lexemes that can take syntactic arguments, e.g., nouns able to take a that-clause (announcement, fact) or adjectives taking a prepositional argument (proud of, happy about). For syntactic arguments, subcat frames typically specify the syntactic category (e.g., noun phrase, verb phrase) and grammatical function (e.g., subject, object).

Predicate argument structure information—For predicate-like words, such as verbs, this refers to a definition of the semantic predicate and information on the semantic arguments, including:

their semantic role according to an inventory of semantic roles given in the context of a particular linguistic theory. There is no standard inventory of semantic roles, i.e., there are linguistic theories assuming small sets of about 40 roles, and others specifying very large sets of several hundred roles. Examples of typical semantic roles are Agent or Patient; and

selectional preference information, which specifies the preferred semantic category of an argument, e.g., whether it is a human or an artifact.

For example, the sense change (“cause to change”) corresponds to a semantic predicate which can be described in natural language as “an Agent causes an Entity to change;” Agent and Entity are semantic roles of this predicate: She[Agent] changed the rules[Entity]; the preferred semantic category of Agent is human.

Related forms—Word forms that are morphologically related, such as compounds or verbs derived from nouns; for example, the verb buy (“purchase”) is derivationally related to the noun buy, while on the other hand buy (“accept as true” e.g., I can’t buy this story) is not derivationally related to the noun buy.

Equivalents—Translations of the sense in other languages; for example, kaufen is the German translation of buy (“purchase”), while abkaufen is the German translation of buy (“accept as true”)

Sense links—Mappings of senses to equivalent senses in other LKBs; for example, the sense change (Cause_change) in FrameNet can be linked to the equivalent sense change (“cause to change”) in WordNet.

There are different ways to organize a LKB, for example, by grouping synonymous senses, or by grouping senses with the same lemma. The latter organization is the traditional head-word based organization used in dictionaries [Atkins and Rundell, 2008] where a LKB consists of lexical entries which group senses under a common headword (the lemma).

There is a large number of so-called Machine-readable Dictionaries (MRD), mostly digitized versions of traditional print dictionaries [Lew, 2011, Soanes and Stevenson, 2003], but also some MRDs are only available in digitized form, such as DANTE [Kilgarriff, 2010] or DWDS4 for German [Klein and Geyken, 2010]. We will not include them in our overview for the following reasons: MRDs have traditionally been built by lexicographers and are targeted toward human use, rather than toward use by automatic processing components in NLP. While MRDs provide information useful in NLP, such as sense definitions, sense examples, as well as grammatical information (e.g., about syntactic behavior), the representation of this information in MRDs usually lacks a strict, formal structure, and thus the information usually suffers from ambiguities. Although such ambiguities can easily be resolved by humans, they are a source of noise when the dictionary entries are processed fully automatically.

Our definition of LKBs also covers domain-specific terminology resources (e.g., the Unified Medical Language System (UMLS) metathesaurus of medical terms [Bodenreider, 2004]) that provide domain-specific terms and sense relations between them. However, we do not include these domain-specific resources in our overview, because we used general language LKBs to develop and evaluate the linking algorithms presented in Chapter 3.

1.1 EXPERT-BUILT LEXICAL KNOWLEDGE BASES

Expert-built LKBs, in our definition of this term, are resources which are designed, created and edited by a group of designated experts, e.g., (computational) lexicographers, (computational) linguists, or psycho-linguists. While it is possible that there is influence on the editorial process from the outside (e.g., via suggestions provided by users or readers), there is usually no direct means of public participation. This form of resource creation has been predominant since the earliest days of lexicography (or, more broadly, creation of language resources), and while the reliance on expert knowledge produces high quality resources, an obvious disadvantage are the slow production cycles—for all of the resources discussed in this section, it usually takes months (if not years) until a new version is published, while at the same time most of the information remains unchanged. This is due to the extensive effort needed for the creation of a resource of considerable size, in most cases provided by a very small group of people. Nevertheless, these resources play a major role in NLP. One reason is that up until recent years there were no real alternatives available, and some of these LKBs also cover aspects of language which are rather specific and not easily accessible for layman editors. We will present the most pertinent examples in this section.

1.1.1 WORDNETS

Wordnets define senses primarily by their relations to other senses, most notably the synonymy relation that is used to group synonymous senses into so-called synsets. Accordingly, synsets are the main organizational units in wordnets. In addition to synonymy, wordnets provide a large variety of additional sense relations. Most of the sense relations are defined on the synset level, i.e., between synsets, such as hypernymy or meronymy. Other sense relations, such as antonymy, are defined between individual senses, rather than between synsets. For example, while evil and unworthy are synonymous (“morally reprehensible” according to WordNet), their antonyms are different; good is the antonym of evil and worthy is the antonym of unworthy.

The Princeton WordNet for English [Fellbaum, 1998a] was the first such wordnet. It became the most popular wordnet and the most widely used LKB today. The creation of the Princeton WordNet is psycholinguisticially motivated, i.e., it aims to represent real-world concepts and relations between them as they are commonly perceived. Version 3.0 contains 117,659 synsets. Apart from its richness in sense relations, WordNet also contains coarse information about the syntactic behavior of verbs in the form of sentence frames (e.g., Somebody –_s something).

There are various works based on the Princeton WordNet, such as the eXtended Word-Net [Mihalcea and Moldovan, 2001a], where all open class words in the sense definitions have been annotated with their WordNet sense to capture further relations between senses, WordNet Domains [Bentivogli et al., 2004] which includes domain labels for senses, or SentiWordNet [Baccianella et al., 2010] which assigns sentiment scores to each synset of WordNet.

Wordnets in Other Languages The Princeton WordNet for English inspired the creation of wordnets in many other languages worldwide and many of them also provide a linking of their senses to the Princeton WordNet. Examples include the Italian wordnet [Toral et al., 2010a], the Japanese wordnet [Isahara et al.], or the German wordnet GermaNet [Hamp and Feldweg, 1997].5

Often, wordnets in other languages have particular characteristics that distinguish them from the Princeton WordNet. GermaNet, for example, containing around 70,000 synsets in version 7.0, originally contained very few sense definitions, but unlike most other wordnets, provides detailed information on the syntactic behavior of verbs. For each verb sense, it lists possible subcat frames, distinguishing more than 200 different types.

It is important to point out, however, that in general, the Princeton WordNet provides richer information than the other wordnets. For example, it includes not only derivational morphological information, but also inflectional morphology analysis within its associated tools. It also provides an ordering of the senses based on the frequency information from the sense-annotated SemCor corpus—which is very useful for word sense disambiguation as many systems using WordNet rely on the sense ordering; see also examples in Chapter 4.

Information Types The lexical information types prevailing in wordnets can be summarized as follows.

Sense definition—Wordnets provide sense definitions at the synset level, i.e., all senses in a synset share the same sense definition.

Sense examples—These are provided for individual senses.

Sense relations—Most sense relations in wordnets are given at the synset level, i.e., all senses in a synset participate in such a relation.

– A special case in wordnets is synonymy, because it is represented via synsets, rather than via relations between senses.

– Most other sense relations are given on the synset level, e.g., hyponymy.

– Few sense relations are defined between senses, e.g., antonymy, which does not always generalize to all members of a synset.

Syntactic behavior—The degree of detail regarding the syntactic behavior varies from wordnet to wordnet. While the Princeton WordNet only distinguishes between few subcat frames, the German wordnet GermaNet distinguishes between about 200 very detailed subcat frames.

Related forms—The Princeton WordNet is rich in information about senses that are related via morphological derivation. Not all wordnets provide this information type.

1.1.2 FRAMENETS

LKBs modeled according to the theory of frame semantics [Fillmore, 1982] focus on word senses that evoke certain scenes or situations, so-called frames which are schematic representations of these. For instance, the “Killing” frame specifies a scene where “A Killer or Cause causes the death of the Victim.” It can be evoked by verbs such as assassinate, behead, terminate or nouns such as liquidation or massacre.

The participants of these scenes (e.g., “Killer” and “Victim” in the “Killing” frame example), as well as other important elements (e.g., “Instrument” as “The device used by the Killer to bring about the death of the Victim” or “Place” as “The location where the death took place”) constitute the semantic roles of the frame (called frame elements in frame semantics), and are typically realized in a sentence along with the frame-evoking element, as in: Someone[Killer] tried to KILL him[Victim] with a parcel bomb[Instrument].

The inventory of semantic roles used in FrameNet is very large and subject to further extension as FrameNet grows. Many semantic roles have frame-specific names, such as the “Killer” semantic role defined in the “Killing” frame.

Frames are the main organizational unit in framenets: they contain senses (represented by their lemma) that evoke the same frame. The majority of the frame-evoking words are verbs and other predicate-like lexemes: they can naturally be represented by frames, since predicates take arguments which can be characterized both syntactically (e.g., subject, direct object) and semantically via their semantic role.

There are semantic relations between frames (e.g., the “Is_Causative_of” relation between “Killing” and “Death” or the “Precedes” relation between “Being_born” and “Death” or “Dying”), and also between frame elements.

The English FrameNet [Baker et al., 1998, Ruppenhofer et al., 2010] was the first frame-semantic LKB and it is the most well-known one. Version 1.6 of FrameNet contains 1,205 frames. In FrameNet, senses are called lexical units. FrameNet does not provide explicit information about the syntactic behavior of word senses. However, the sense examples are annotated with syntactic information (FrameNet annotation sets) and from these annotations, subcat frames can be induced.

FrameNet is particularly rich in sense examples, which are selected based on lexicographic criteria, i.e., the sense examples are chosen to illustrate typical syntactic realizations of the frame elements. The sense examples are enriched with annotations of the frame and its elements, and thus provide information about the relative frequencies of the syntactic realizations of a particular frame element. For example, for the verb kill, a noun phrase with the grammatical function object is the most frequently used syntactic realization of the “Victim” role.

Framenets in Other Languages The English FrameNet has spawned the construction of framenets in multiple other languages. For example, there are framenets for Spanish6 [Subirats and Sato, 2004], Swedish7 [Friberg Heppin and Toporowska Gronostaj, 2012], and Japanese8 [Ohara, 2012]. For Danish, there is an ongoing effort to build a framenet based on a large-scale valency LKB that is manually being extended by frame-semantic information [Bick, 2011]. For German, there is a corpus annotated with FrameNet frames called SALSA [Burchardt et al., 2006].

Information Types The following information types in the English FrameNet are most salient.

Sense definition—For individual senses, FrameNet provides sense definitions, either taken from the Concise Oxford Dictionary or created by lexicographers. Furthermore, there is a sense definition for each frame, which is given by a textual description and shared by all senses in a frame.

Sense examples—FrameNet is particularly rich in sense examples which are selected based on lexicographic criteria.

Sense relations—FrameNet specifies sense relations on the frame level, i.e., all senses in a frame participate in the relation.

Predicate argument structure information—Semantic roles often have frame-specific names and are specified via a textual description. Some frame elements are further characterized via their semantic type, thus selectional preference information is provided as well.

1.1.3 VALENCY LEXICONS

Most of the early work on LKBs for NLP considered valency as a central information type, because it was essential for deep syntactic and semantic parsing with broad-coverage hand-written grammars (e.g., Head-Driven Phrase Structure Grammar [Copestake and Flickinger], or Lexical Functional Grammar as in the ParGram project [Sulger et al., 2013]). Valency is a lexical property of a word to require certain syntactic arguments in order to be used in well-formed phrases or clauses. For example, the verb assassinate requires not only a subject, but also an object: *He assassinated. vs. He assassinated his colleague. Valency information is also included in MRDs, but often represented ambiguously and thus is hard to process automatically. Therefore, a number of valency LKBs have been built specifically for NLP applications. These LKBs use subcat frames to represent valency information.

It is important to note that subcat frames are a lexical property of senses, rather than words. Consider the following example of the two senses of see and their sense-specific subcat frames (1) and (2): subcat frame (1) is only valid for the see—“interpret in a particular way” sense, but not for the see—“perceive with the eyes” sense:

see—“interpret in a particular way:”

subcat frame (1): (arg1:subject(nounPhrase),arg2:prepositionalObject(asPhrase))

sense example: Some historians see his usurpation as a panic response to growing insecurity.

see—“perceive with the eyes:”

subcat frame (2): (arg1:subject(nounPhrase),arg2:object(nounPhrase))

sense example: Can you see the bird in that tree?

Subcat frames contain language-specific elements, even though some of their elements may be valid cross-lingually. For example, there are certain properties of syntactic arguments in English and German that correspond (both English and German are Germanic languages and hence closely related), while other properties, mainly morphosyntactic ones, diverge [Eckle-Kohler and Gurevych, 2012]. Examples of such divergences include the overt case marking in German (e.g., for the dative case) or the fact that the ing-form in English verb phrase complements is sometimes realized as zu-infinitive in German.

According to many researchers in linguistics, different subcat frames of a lexeme are associated with different but related meanings, an analysis which is called the “multiple meaning approach” by Hovav and Levin [2008].9 The multiple meaning approach gives rise to different senses, i.e., pairs of lexeme and subcat frame. Hence, valency LKBs provide an implicit characterization of senses via subcat frames, which can be considered as abstractions of sense examples. Sense examples illustrating a lexeme in a particular subcat frame (e.g., extracted from corpora) might be provided in addition. However, valency LKBs do not necessarily assign unique identifiers to senses, or group (nearly) synonymous senses into entries (as MRDs do).

Examples of Valency Lexicons COMLEX Syntax is an English valency LKB providing detailed subcat frames for about 38,000 headwords [Grishman et al., 1994]. Another well-known valency LKB is CELEX, which covers English, as well as Dutch and German. The PAROLE project (Preparatory Action for Linguistic Resources Organization for Language Engineering), initiated the creation of valency LKBs in 12 European languages (Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish), which have all been built on the basis of corpora. However, the resulting LKBs are much smaller. For example, the Spanish PAROLE lexicon contains syntactic information for only about 325 verbs [Villegas and Bel, 2015].

There are many valency LKBs in languages other than English. For German, an example of a large-scale valency LKB is IMSLex-Subcat, a broad-coverage subcategorization lexicon for German verbs, nouns and adjectives, covering about 10,000 verbs, 4,000 nouns, and 200 adjectives [Eckle-Kohler, 1999, Fitschen, 2004]. For verbs, about 350 different subcat frames are distinguished. IMSLex-Subcat was semi-automatically created: the subcat frames were automatically extracted from large newspaper corpora, and manually filtered afterward.

Information Types In summary, the following lexical information types are salient for valency LKBs.

Syntactic behavior—Valency LKBs provide lexical-syntactic information on predicate-like words by specifying their syntactic behavior via subcat frames.

Sense examples—For individual pairs of lexeme and subcat frame, sense examples might be given as well.

1.1.4 VERBNETS

According to Levin [1993], verbs that share common syntactic argument alternation patterns also have particular meaning components in common, thus they can be grouped into semantic verb classes. Consider as an example verbs participating in the dative alternation, e.g., give and sell. These verbs can realize one of their arguments syntactically either as a noun phrase or as a prepositional phrase with to, i.e., they can be used with two different subcat frames:

Martha gives (sells) an apple to Myrna.

Martha gives (sells) Myrna an apple.

Verbs having this alternation behavior in common can be grouped into a semantic class of verbs sharing the particular meaning component “change of possession,” thus this shared meaning component characterizes the semantic class.

The most well-known verb classification based on the correspondence between verb syntax and verb meaning is Levin’s classification of English verbs [Levin, 1993]. Recent work on verb semantics provides additional evidence for this correspondence of verb syntax and meaning [Hartshorne et al., 2014, Levin, 2015].

The English VerbNet [Kipper et al., 2008] is a broad-coverage verb lexicon based on Levin’s classification covering about 3,800 verb lemmas. VerbNet is organized in about 270 verb classes based on syntactic alternations. Verbs with common subcat frames and syntactic alternation behavior that also share common semantic roles are grouped into VerbNet classes, which are hierarchically structured to represent information about related subcat frames.

VerbNet not only includes the verbs from the original verb classification by Levin, but also more than 50 additional verb classes [Kipper et al., 2006] automatically acquired from corpora [Korhonen and Briscoe, 2004]. These classes cover many verbs taking non-finite verb phrases and subordinate clauses as complements, which were not included in Levin’s original classification. VerbNet (version 3.1) lists 568 subcat frames specifying syntactic types and semantic roles of the arguments, as well as selectional preferences, and syntactic and morpho-syntactic restrictions on the arguments.

Although it might often be hard to pin down what the shared meaning components of VerbNet classes really are, VerbNet has successfully been used in various NLP tasks, many of them including the subtask of mapping syntactic chunks of a sentence to semantic roles [Pradet et al., 2014]; see also Chapter 6.1 for an example.

Verbnets in Other Languages While the importance of having a verbnet-like LKB in less-resourced languages has been widely recognized, there have rarely been any efforts to build such high-quality verbnets as the English one. Most previous work explored fully automatic approaches to transfer the English VerbNet to another language, thus introducing noise. Semi-automatic approaches are also often based on translating the English VerbNet into another language.

Most importantly, many of the detailed subcat frames available for English, as well as the syntactic alternations, cannot be carried over to other languages, since valency is largely language-specific (e.g., [Scarton and Aluísio, 2012]). Therefore, the development of high-quality verbnets in languages other than English requires the existence of a broad-coverage valency lexicon as a prerequisite. For this reason, valency lexicons, especially tools for their (semi-)automatic construction, are still receiving considerable attention.

A recent example of a high-quality verbnet in another language is the French verbnet (covering about 2,000 verb lemmas) [Pradet et al., 2014] which has been built semi-automatically from existing French resources (thus also including subcat frames) combined with a translation of the English VerbNet verbs.

Information Types We summarize the main lexical information types for senses present in the English VerbNet.

Sense definition—Verbnets do not provide textual sense definitions. A verb sense is defined extensionally by the set of verbs forming a VerbNet class; the verbs share common subcat frames, as well as semantic roles and selectional preferences of their arguments.

Sense relations—The verb classes in verbnets are organized hierarchically and the subclass relation is therefore defined on the verb class level.

Syntactic behavior—VerbNet lists detailed subcat frames for verb senses.

Predicate argument structure information—In the English VerbNet, each individual verb sense is characterized by a semi-formal semantic predicate based on the event decomposition of Moens and Steedman [1988]. Furthermore, the semantic arguments of a verb are characterized by their semantic role and linked to their syntactic counterparts in the subcat frame. Most semantic arguments are additionally characterized by their semantic type (i.e., selectional preference information).

1.2 COLLABORATIVELY CONSTRUCTED KNOWLEDGE BASES

More recently, the rapid development of Web technologies and especially collaborative participation channels (often labeled “Web 2.0”) has offered new possibilities for the construction of language resources. The basic idea is that, instead of a small group of experts, a community of users (“crowd”) collaboratively gathers and edits the lexical information in an open and equitable process. The resulting knowledge is in turn also free to use, adapt and extend for everyone. This open approach has turned out to be very promising to handle the enormous effort of building language resources, as a large community can quickly adapt to new language phenomena like neologisms while at the same time maintaining a high quality by continuous revision—a phenomenon which has become known as the “wisdom of crowds” [Surowiecki, 2005]. The approach also seems to be suitable for multilingual resources, as users speaking any language and from any culture can easily contribute. This is very helpful for minor, usually resource-poor languages where expert-built resources are small or not available at all.

1.2.1 WIKIPEDIA

Wikipedia10 is a collaboratively constructed online encyclopedia and one of the largest freely available knowledge sources. It has long surpassed traditional printed encyclopedias in size, while maintaining a comparative quality [Giles, 2005]. The current English version contains around 4,700,000 articles and is by far the largest one, while there are many language editions of significant size. Some, like the German or French editions, also contain more than 1,000,000 articles, each of which usually describes a particular concept.

Although Wikipedia has not been designed as a sense inventory, we can interpret the pairing of an article title and the concept described in the article text as a sense. This interpretation is in accordance with the disambiguation provided in Wikipedia, either as part of the title or on separate disambiguation pages. An example of the former are some articles for Java where its different meanings are marked by “bracketed disambiguations” in the article title such as Java (programming language) and Java (town). An example of the latter is the dedicated disambiguation page for Java which explicitly lists all Java senses contained in Wikipedia.

Due to its focus on encyclopedic knowledge, Wikipedia almost exclusively contains nouns. Similar as for word senses, the interpretation of Wikipedia as a LKB gives rise to the induction of further lexical information types, such as sense relations of translations. Since the original purpose of Wikipedia is not to serve as a LKB, this induction process might also lead to inaccurate lexical information. For instance, the links to corresponding articles in other languages provided for Wikipedia articles can be used to derive translations (i.e., equivalents) of an article “sense” into other languages. An example where this leads to an inaccurate translation is the English article Vanilla extract which links to a subsection titled Vanilleextrakt within the German article Vanille (Gewürz); according to our lexical interpretation of Wikipedia, this leads to the inaccurate German equivalent Vanille (Gewürz) for Vanilla extract.

Nevertheless, Wikipedia is commonly used as a lexical resource in computational linguistics where it was introduced as such by Zesch et al. [2007], and has subsequently been used for knowledge mining [Erdmann et al., 2009, Medelyan et al., 2009] and various other tasks [Gurevych and Kim, 2012].

Information Types We can derive the following lexical information types from Wikipedia.

Sense definition—While by design one article describes one particular concept, the first paragraph of an article usually gives a concise summary of the concept, which can therefore fulfill the role of a sense definition for NLP purposes.

Sense examples—While usage examples are not explicitly encoded in Wikipedia, they are also inferable by considering the Wikipedia link structure. If a term is linked within an article, the surrounding sentence can be considered as a usage example for the target concept of the link.

Sense relations—Related articles, i.e., senses, are connected via hyperlinks within the article text. However, since the type of the relation is usually missing, these hyperlinks cannot be considered full-fledged sense relations. Nevertheless, they express a certain degree of semantic relatedness. The same observation holds for the Wikipedia category structure which links articles belonging to particular domains.

Equivalents—The different language editions of Wikipedia are interlinked at the article level—the article titles in other languages can thus be used as translation equivalents.

Related Projects As Wikipedia has nowadays become one of the largest and most widely used knowledge sources, there have been numerous efforts to make it better accessible for automatic processing. These include projects such as YAGO [Suchanek et al., 2007], DBPedia [Bizer et al., 2009], WikiNet [Nastase et al., 2010], MENTA [de Melo and Weikum, 2010], or DBPedia [Bizer et al., 2009]. Most of them aim at deriving a concept network from Wikipedia (“ontologizing”) and making it available for Semantic Web applications. WikiData,11—a project directly rooted in Wikimedia—has similar goals, but within the framework given by Wikipedia. The goal here is to provide a language-independent repository of structured world knowledge, which all language editions can easily integrate.

These related projects basically contain the same knowledge as Wikipedia, only in a different representation format (e.g., suitable for Semantic Web applications), hence we will not discuss them further in this chapter. However, some of the Wikipedia derivatives have reached a wide audience in different communities, including NLP (e.g., DBPedia), and have also been used in different linking efforts, especially in the domain of ontology construction. We will describe corresponding efforts in Chapter 2

1.2.2 WIKTIONARY

Wiktionary12 is a dictionary “side project” of Wikipedia that was created in order to better cater for the need to represent specific lexicographic knowledge, which is not well suited for an encyclopedia, e.g., lexical knowledge about verbs and adjectives. Wiktionary is available in over 500 languages, and currently the English edition of Wiktionary contains almost 4,000,000 lexical entry pages, while many other language editions achieve a considerable size of over 100,000 entries. Meyer and Gurevych [2012b] found that the collaborative construction approach of Wiktionary yields language versions covering the majority of language families and regions of the world, and that it especially covers a vast amount of domain-specific descriptions not found in wordnets for these languages.

For each lexeme, multiple senses can be encoded, and these are usually described by glosses. Wiktionary contains hyperlinks which lead to semantically related lexemes, such as synonyms, hypernyms, or meronyms, and provides a variety of other information types such as etymology or translations to other languages. However, the link targets are not disambiguated in all language editions, e.g., in the English edition, the links merely lead to pages for the lexical entries, which is problematic for NLP applications as we will see later on. The ambiguity of the links is due to the fact that Wiktionary has been primarily designed to be used by humans rather than machines. The entries are thus formatted for easy perception using appropriate font sizes and bold, italic, or colored text styles. In contrast, for machines, data needs to be available in a structured and unambiguous manner in order to become directly accessible. For instance, an easily accessible data structure for machines would be a list of all translations of a given sense, and encoding the translations by their corresponding sense identifiers in the target language LKBs would make the representation unambiguous.

This kind of explicit and unambiguous structure does not exist in Wiktionary, but needs to be inferred from the wiki markup.13 Although there are guidelines on how to properly structure a Wiktionary entry, Wiktionary editors are permitted to choose from multiple variants or to deviate from the standards if this can enhance the entry. This presents a major challenge for the automatic processing of Wiktionary data. Another hurdle is the openness of Wiktionary—that is, the possibility to perform structural changes at any time, which raises the need for constant revision of the extraction software.

Wiktionary as a resource for NLP has been introduced by Zesch et al. [2008b], and has been considered in many different contexts in subsequent work [Gurevych and Wolf, 2010, Krizhanovsky, 2012, Meyer, 2013, Meyer and Gurevych, 2010, 2012b]. While much work on Wiktionary specifically focuses on few selected language editions, the multilingual LKB Dbnary by Sérasset and Tchechmedjiev [2014] has taken a much broader approach and derived a LKB from Wiktionary editions in 12 languages. A major goal of DBnary is to make Wiktionary easily accessible for automatic processing, especially in Semantic Web applications [Sérasset, 2015].

Particularly interesting for this book are the recent efforts to ontologize Wiktionary and transform it into a standard-compliant, machine-readable format [Meyer and Gurevych, 2012a]. These efforts address issues which are also relevant for the construction of Linked Lexical Knowledge Bases (LLKBs) we will discuss later on. We refer the interested reader to Meyer [2013] for an in-depth survey of Wiktionary from a lexicographic perspective and as a resource for NLP.

Information Types In summary, the main information types contained in Wiktionary are as follows.

Sense definition—Glosses are given for the majority of senses, but due to the open editing approach gaps or “stub” definitions are explicitly allowed. This is especially the case for smaller language editions.

Sense examples—Example sentences which illustrate the usage of a sense are given for a subset of senses.

Sense relations—As mentioned above, semantic relations are generally available, but depending on the language edition, these might be ambiguously encoded. Moreover, different language editions show a great variety of the amount of relations relative to the number of senses. For instance, the German edition is six times more densely linked than the English one.

Syntactic behavior—Lexical-syntactic properties are given for a small set of senses. These include subcat frame labels, such as “transitive” or “intransitive.”

Related forms—Related forms are available via links.

Equivalents—As for Wikipedia, translations of senses to other languages are available by links to other language editions. An interesting peculiarity of Wiktionary is that distinct language editions may also contain entries for foreign-language words, for instance, the English edition also contains German lexemes, complete with definitions etc. in English. This is meant as an aid for language learners and is frequently used.

Sense links—Many Wiktionary entries contain links to the corresponding Wikipedia page, thus providing an easy means to supply additional knowledge about a particular concept without overburdening Wiktionary with non-essential (i.e., encyclopedic) information.

In general, it has to be noted that the flexibility of Wiktionary enables the encoding of all kinds of linguistic knowledge, at least in theory. In practice, the information types listed here are those which are commonly used, and thus interesting for our subsequent considerations.

1.2.3 OMEGAWIKI

OmegaWiki,14 like Wiktionary, is freely editable via its web frontend. The current version of OmegaWiki contains over 46,000 concepts and lexicalizations in almost 500 languages. One of OmegaWiki’s discriminating features, in comparison to other collaboratively constructed resources, is that it is based on a fixed database structure which users have to comply with [Matuschek and Gurevych, 2011]. It was initiated in 2006 and explicitly designed with the goal of offering structured and consistent access to lexical information, i.e., avoiding the shortcomings of Wiktionary described above.

To this end, the creators of OmegaWiki decided to limit the degrees of freedom for contributors by providing a “scaffold” of elements which interact in well-defined ways. The central elements of OmegaWiki’s organizational structure are language-independent concepts (so-called defined meanings) to which lexicalizations of the concepts are attached. Defined meanings can thus be considered as multilingual synsets, comparable to resources such as WordNet (cf. Section 1.1.1). Consequently, no specific language editions exist for OmegaWiki as they do for Wiktionary. Rather, all multilingual information is encoded in a single resource.

As an example, defined meaning no. 5616 (representing the concept HAND) carries the lexicalizations hand, main, mano, etc., and also definitions in different languages which describe this concept, for example, “That part of the fore limb below the forearm or wrist.” The multilingual synsets directly yield correct translations as these are merely different lexicalizations of the same concept. It is also possible to have multiple lexicalizations in the same language, i.e., synonyms. An interesting consequence of this design, especially for multilingual applications, is that semantic relations are defined between concepts regardless of existing lexicalizations. Consider, for example, the Spanish noun dedo: it is marked as hypernym of finger and toe, although there exists no corresponding lexicalization for the defined meaning FINGER OR TOE in English. This is, for instance, immediately helpful in translation tasks, since concepts for which no lexicalization in the target language exists can be described or replaced by closely related concepts. Using this kind of information is not as straightforward as in other multilingual resources like Wiktionary, because the links are not necessarily unambiguous.

The fixed structure of OmegaWiki ensures easy extraction of the information due to the consistency enforced by the definition of database tables and relations between them. However, it has the drawback of limited expressiveness, for instance, the coding of grammatical properties is only possible to a small extent. In OmegaWiki, the users are not allowed to extend this structure and thus are tied to what has been already defined. Consequently, OmegaWiki’s lack of flexibility and extensibility, in combination with the fact that Wiktionary was already quite popular at its creation time, has caused the OmegaWiki community to remain rather small. While OmegaWiki had 6,746 users at the time of writing, only 19 of them had actively been editing in the past month, i.e., the community is considerably smaller than for Wikipedia or Wiktionary [Meyer, 2013]. Despite the above-mentioned issues, we still believe that OmegaWiki is not only interesting for usage in NLP applications (and thereby for integration into LLKBs), but also as a case study, since it exemplifies how the process of collaboratively creating a large-scale lexical-semantic resource can be guided by means of a structural “skeleton.”

Information Types The most salient information types in OmegaWiki, i.e., those encoded in a relevant portion of entries are as follows.

Sense definitions—Glosses are provided on the concept level, usually in multiple languages.

Sense examples—Examples are given for individual lexicalizations, but only for a few of them.

Sense relations—Semantic as well as ontological relations (e.g., “Germany” borders on “France”) are given, and these are entirely disambiguated.

Equivalents—Translations are encoded by the multilingual synsets which group lexicalizations of a concept in different languages.

Sense links—As for Wiktionary, mostly links to related Wikipedia articles are given to provide more background knowledge about particular concepts.

1.3 STANDARDS

Since LKBs play an important role in many NLP tasks and are expensive to build, the capability to exchange, reuse, and also merge them has become a major requirement. Standardization of LKBs plays an important role in this context, because it allows to build uniform APIs, and thus facilitates exchange and reuse, as well as integration and merging of LKBs. Moreover, applications can easily switch between different standardized LKBs.

1.3.1 ISO LEXICAL MARKUP FRAMEWORK

The ISO standard Lexical Markup Framework (LMF) [Calzolari et al., 2013, Francopoulo and George, 2013, ISO24613, 2008] was developed to address these issues. LMF is an abstract standard, it defines a meta-model of lexical resources, covering both NLP lexicons and machine readable dictionaries. The standard specifies this meta-model in the Unified Modeling Language (UML) by providing a set of UML diagrams. UML packages are used to organize the metamodel and each diagram given in the standard corresponds to an UML package. LMF defines a mandatory core package and a number of extension packages for different types of resources, e.g., morphological resources or wordnets. The core package models a lexicon in the traditional headword-based fashion, i.e., organized by lexical entries. Each lexical entry is defined as the pairing of one to many forms and zero to many senses.

The abstract meta-model given by the LMF standard is not immediately usable as a format for encoding (i.e., converting) an existing LKB [Tokunaga et al., 2009]. It has to be instantiated first, i.e., a full-fledged lexicon model has to be developed by choosing LMF classes and by specifying suitable attributes for these LMF classes.

According to the standard, developing a lexicon model involves

1. selecting LMF extension packages (the usage of the core package is mandatory),

2. defining attributes for the classes in the core package and in the extension packages (as they are not prescribed by the standard), and

3. explicating the linguistic terminology, i.e., linking the attributes and other linguistic terms introduced (e.g., attribute values) to standardized descriptions of their meaning.

Selecting a combination of LMF classes and their relationships from the LMF core package and from the extension packages establishes the structure of a lexicon model. While the LMF core package models a lexicon in terms of lexical entries, the LMF extensions provide classes for different types of lexicon organization, e.g., covering the synset-based organization of wordnets or the semantic frame-based organization of FrameNet.

Fixing the structure of a lexicon model by choosing a set of classes contributes to the interoperability of LKBs, as it determines the high-level organization of lexical knowledge in a resource, e.g., whether synonymy is encoded by grouping senses into synsets (using the Synset class) or by specifying sense relations (using the SenseRelation class), which connect synonymous senses (i.e., synonyms). Defining attributes for the LMF classes and specifying the attribute values is far more challenging than choosing from a given set of classes, because the standard gives only a few examples of attributes and leaves the specification of attributes to the user in order to allow maximum flexibility.

Finally, the attributes and values have to be linked to a description of their meaning in an ISO compliant Data Category Registry [ISO12620, 2009, Windhouwer and Wright, 2013]. For example, ISOcat15 was the first implementation of the ISO Data Category Registry standard [ISO12620, 2009].16 The data model defined by the Data Category Registry specifies some mandatory information types for its entries, including a unique administrative identifier (e.g., partOfSpeech) and a unique and persistent identifier (PID, e.g., http://www.isocat.org/datcat/DC-396) which can be used in automatic processing and annotation, in order to link to the entries. From a practical point of view, a Data Category Registry can be considered as a repository of mostly linguistic terminology which provides human-readable descriptions of the meaning of terms used in language resources. For instance, the meaning of many terms used for linguistic annotation is given in ISOcat, such as grammaticalNumber, gender, case. Accordingly, a Data Category Registry can be used as a glossary: users can look up the meaning of a term occurring in a language resource by consulting its entry in the Data Category Registry.

Linked Lexical Knowledge Bases

Подняться наверх