1.1.3 VALENCY LEXICONS
Most of the early work on LKBs for NLP considered valency as a central information type, because it was essential for deep syntactic and semantic parsing with broad-coverage hand-written grammars (e.g., Head-Driven Phrase Structure Grammar [Copestake and Flickinger], or Lexical Functional Grammar as in the ParGram project [Sulger et al., 2013]). Valency is a lexical property of a word to require certain syntactic arguments in order to be used in well-formed phrases or clauses. For example, the verb assassinate requires not only a subject, but also an object: *He assassinated. vs. He assassinated his colleague. Valency information is also included in MRDs, but often represented ambiguously and thus is hard to process automatically. Therefore, a number of valency LKBs have been built specifically for NLP applications. These LKBs use subcat frames to represent valency information.
It is important to note that subcat frames are a lexical property of senses, rather than words. Consider the following example of the two senses of see and their sense-specific subcat frames (1) and (2): subcat frame (1) is only valid for the see—“interpret in a particular way” sense, but not for the see—“perceive with the eyes” sense:
see—“interpret in a particular way:”
subcat frame (1): (arg1:subject(nounPhrase),arg2:prepositionalObject(asPhrase))
sense example: Some historians see his usurpation as a panic response to growing insecurity.
see—“perceive with the eyes:”
subcat frame (2): (arg1:subject(nounPhrase),arg2:object(nounPhrase))
sense example: Can you see the bird in that tree?
Subcat frames contain language-specific elements, even though some of their elements may be valid cross-lingually. For example, there are certain properties of syntactic arguments in English and German that correspond (both English and German are Germanic languages and hence closely related), while other properties, mainly morphosyntactic ones, diverge [Eckle-Kohler and Gurevych, 2012]. Examples of such divergences include the overt case marking in German (e.g., for the dative case) or the fact that the ing-form in English verb phrase complements is sometimes realized as zu-infinitive in German.
According to many researchers in linguistics, different subcat frames of a lexeme are associated with different but related meanings, an analysis which is called the “multiple meaning approach” by Hovav and Levin [2008].9 The multiple meaning approach gives rise to different senses, i.e., pairs of lexeme and subcat frame. Hence, valency LKBs provide an implicit characterization of senses via subcat frames, which can be considered as abstractions of sense examples. Sense examples illustrating a lexeme in a particular subcat frame (e.g., extracted from corpora) might be provided in addition. However, valency LKBs do not necessarily assign unique identifiers to senses, or group (nearly) synonymous senses into entries (as MRDs do).
Examples of Valency Lexicons COMLEX Syntax is an English valency LKB providing detailed subcat frames for about 38,000 headwords [Grishman et al., 1994]. Another well-known valency LKB is CELEX, which covers English, as well as Dutch and German. The PAROLE project (Preparatory Action for Linguistic Resources Organization for Language Engineering), initiated the creation of valency LKBs in 12 European languages (Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish), which have all been built on the basis of corpora. However, the resulting LKBs are much smaller. For example, the Spanish PAROLE lexicon contains syntactic information for only about 325 verbs [Villegas and Bel, 2015].
There are many valency LKBs in languages other than English. For German, an example of a large-scale valency LKB is IMSLex-Subcat, a broad-coverage subcategorization lexicon for German verbs, nouns and adjectives, covering about 10,000 verbs, 4,000 nouns, and 200 adjectives [Eckle-Kohler, 1999, Fitschen, 2004]. For verbs, about 350 different subcat frames are distinguished. IMSLex-Subcat was semi-automatically created: the subcat frames were automatically extracted from large newspaper corpora, and manually filtered afterward.
Information Types In summary, the following lexical information types are salient for valency LKBs.
• Syntactic behavior—Valency LKBs provide lexical-syntactic information on predicate-like words by specifying their syntactic behavior via subcat frames.
• Sense examples—For individual pairs of lexeme and subcat frame, sense examples might be given as well.
1.1.4 VERBNETS
According to Levin [1993], verbs that share common syntactic argument alternation patterns also have particular meaning components in common, thus they can be grouped into semantic verb classes. Consider as an example verbs participating in the dative alternation, e.g., give and sell. These verbs can realize one of their arguments syntactically either as a noun phrase or as a prepositional phrase with to, i.e., they can be used with two different subcat frames:
Martha gives (sells) an apple to Myrna.
Martha gives (sells) Myrna an apple.
Verbs having this alternation behavior in common can be grouped into a semantic class of verbs sharing the particular meaning component “change of possession,” thus this shared meaning component characterizes the semantic class.
The most well-known verb classification based on the correspondence between verb syntax and verb meaning is Levin’s classification of English verbs [Levin, 1993]. Recent work on verb semantics provides additional evidence for this correspondence of verb syntax and meaning [Hartshorne et al., 2014, Levin, 2015].
The English VerbNet [Kipper et al., 2008] is a broad-coverage verb lexicon based on Levin’s classification covering about 3,800 verb lemmas. VerbNet is organized in about 270 verb classes based on syntactic alternations. Verbs with common subcat frames and syntactic alternation behavior that also share common semantic roles are grouped into VerbNet classes, which are hierarchically structured to represent information about related subcat frames.
VerbNet not only includes the verbs from the original verb classification by Levin, but also more than 50 additional verb classes [Kipper et al., 2006] automatically acquired from corpora [Korhonen and Briscoe, 2004]. These classes cover many verbs taking non-finite verb phrases and subordinate clauses as complements, which were not included in Levin’s original classification. VerbNet (version 3.1) lists 568 subcat frames specifying syntactic types and semantic roles of the arguments, as well as selectional preferences, and syntactic and morpho-syntactic restrictions on the arguments.
Although it might often be hard to pin down what the shared meaning components of VerbNet classes really are, VerbNet has successfully been used in various NLP tasks, many of them including the subtask of mapping syntactic chunks of a sentence to semantic roles [Pradet et al., 2014]; see also Chapter 6.1 for an example.
Verbnets in Other Languages While the importance of having a verbnet-like LKB in less-resourced languages has been widely recognized, there have rarely been any efforts to build such high-quality verbnets as the English one. Most previous work explored fully automatic approaches to transfer the English VerbNet to another language, thus introducing noise. Semi-automatic approaches are also often based on translating the English VerbNet into another language.
Most importantly, many of the detailed subcat frames available for English, as well as the syntactic alternations, cannot be carried over to other languages, since valency is largely language-specific (e.g., [Scarton and Aluísio, 2012]). Therefore, the development of high-quality verbnets in languages other than English requires the existence of a broad-coverage valency lexicon as a prerequisite. For this reason, valency lexicons, especially tools for their (semi-)automatic construction, are still receiving considerable attention.
A recent example of a high-quality verbnet in another language is the French verbnet (covering