The following list provides an overview of the main lexical information types distinguished at the level of word senses.
• Sense definition—A definition of the sense in natural language (also called gloss) meant for human interpretation; for example, “to communicate with someone in writing” is a sense definition for the sense write01
given above.
• Sense examples—Example sentences which illustrate the sense in context; for example, He wrote her an email. is a sense example of the sense write01
.
• Sense relations—Lexical-semantic relations to other senses. We list the most salient ones.
– Synonymy connects senses which are lexically different but share the same meaning. Synonymy is reflexive, symmetrical, and transitive. For example, the verbs change and modify are synonyms3 as they share the meaning “cause to change.”
Some resources such as WordNet subsume synonymous senses into synsets. However, for the linking algorithms presented in this book, we will usually not distinguish between sense and synset, as for most discussions and experiments in this particular context they can be used interchangeably.
– Antonymy is a relation in which the source and target sense have opposite meanings (e.g., tall and small).
– Hyponymy denotes a semantic relation where the target sense has a more specific meaning than the source sense (e.g., from limb to arm).
– Hypernymy is the inverse relation of hyponymy and thus denotes a semantic relation in which the target sense has a more general meaning than the source sense.
• Syntactic behavior—Lexical-syntactic properties, such as the valency of verbs, i.e., the number and type of syntactic arguments a verb takes; for example, the verb change (“cause to change”) can take a noun phrase subject and a noun phrase object as syntactic arguments, as in: She[subject] changed the rules[object].
In LKBs, valency is represented by subcategorization frames (short: subcat frames). They specify syntactic arguments of verbs, but also of other predicate-like lexemes that can take syntactic arguments, e.g., nouns able to take a that-clause (announcement, fact) or adjectives taking a prepositional argument (proud of, happy about). For syntactic arguments, subcat frames typically specify the syntactic category (e.g., noun phrase, verb phrase) and grammatical function (e.g., subject, object).
• Predicate argument structure information—For predicate-like words, such as verbs, this refers to a definition of the semantic predicate and information on the semantic arguments, including:
– their semantic role according to an inventory of semantic roles given in the context of a particular linguistic theory. There is no standard inventory of semantic roles, i.e., there are linguistic theories assuming small sets of about 40 roles, and others specifying very large sets of several hundred roles. Examples of typical semantic roles are Agent or Patient; and
– selectional preference information, which specifies the preferred semantic category of an argument, e.g., whether it is a human or an artifact.
For example, the sense change (“cause to change”) corresponds to a semantic predicate which can be described in natural language as “an Agent causes an Entity to change;” Agent and Entity are semantic roles of this predicate: She[Agent] changed the rules[Entity]; the preferred semantic category of Agent is human.
• Related forms—Word forms that are morphologically related, such as compounds or verbs derived from nouns; for example, the verb buy (“purchase”) is derivationally related to the noun buy, while on the other hand buy (“accept as true” e.g., I can’t buy this story) is not derivationally related to the noun buy.
• Equivalents—Translations of the sense in other languages; for example, kaufen is the German translation of buy (“purchase”), while abkaufen is the German translation of buy (“accept as true”)
• Sense links—Mappings of senses to equivalent senses in other LKBs; for example, the sense change (Cause_change) in FrameNet can be linked to the equivalent sense change (“cause to change”) in WordNet.
There are different ways to organize a LKB, for example, by grouping synonymous senses, or by grouping senses with the same lemma. The latter organization is the traditional head-word based organization used in dictionaries [Atkins and Rundell, 2008] where a LKB consists of lexical entries which group senses under a common headword (the lemma).
There is a large number of so-called Machine-readable Dictionaries (MRD), mostly digitized versions of traditional print dictionaries [Lew, 2011, Soanes and Stevenson, 2003], but also some MRDs are only available in digitized form, such as DANTE [Kilgarriff, 2010] or DWDS4 for German [Klein and Geyken, 2010]. We will not include them in our overview for the following reasons: MRDs have traditionally been built by lexicographers and are targeted toward human use, rather than toward use by automatic processing components in NLP. While MRDs provide information useful in NLP, such as sense definitions, sense examples, as well as grammatical information (e.g., about syntactic behavior), the representation of this information in MRDs usually lacks a strict, formal structure, and thus the information usually suffers from ambiguities. Although such ambiguities can easily be resolved by humans, they are a source of noise when the dictionary entries are processed fully automatically.
Our definition of LKBs also covers domain-specific terminology resources (e.g., the Unified Medical Language System (UMLS) metathesaurus of medical terms [Bodenreider, 2004]) that provide domain-specific terms and sense relations between them. However, we do not include these domain-specific resources in our overview, because we used general language LKBs to develop and evaluate the linking algorithms presented in Chapter 3.
1.1 EXPERT-BUILT LEXICAL KNOWLEDGE BASES
Expert-built LKBs, in our definition of this term, are resources which are designed, created and edited by a group of designated experts, e.g., (computational) lexicographers, (computational) linguists, or psycho-linguists. While it is possible that there is influence on the editorial process from the outside (e.g., via suggestions provided by users or readers), there is usually no direct means of public participation. This form of resource creation has been predominant since the earliest days of lexicography (or, more broadly, creation of language resources), and while the reliance on expert knowledge produces high quality resources, an obvious disadvantage are the slow production cycles—for all of the resources discussed in this section, it usually takes months (if not years) until a new version is published, while at the same time most of the information remains unchanged. This is due to the extensive effort needed for the creation of a resource of considerable size, in most cases provided by a very small group of people. Nevertheless, these resources play a major role in NLP. One reason is that up until recent years there were no real alternatives available, and some of these LKBs also cover aspects of language which are rather specific and not easily accessible for layman editors. We will present the most pertinent examples in this section.
1.1.1 WORDNETS
Wordnets define senses primarily by their relations to other senses, most notably the synonymy relation that is used to group synonymous senses into so-called synsets. Accordingly, synsets are the main organizational units in wordnets. In