Co-reference resolution aims at connecting together different mentions of the same entity. This task is important because it helps with finding relations between entities later, and it also helps with named entity linking. The different mentions may be identical references, in which case the task is easy, or the task may be more complicated because the same entity can be mentioned in different ways. For example, John Smith, Mr. Smith, John, J. S. Smith, and Smith could all refer to the same person. Similarly, we may have acronyms (U.K. and United Kingdom) or even aliases which bear no surface resemblance to their alternative name (IBM and The Big Blue). With the exception of the latter form, where lists of explicit name pairs are often the best solution, rule-based systems tend to be quite effective for this task. For example, even though acronyms are often highly ambiguous, in the context of the same document it is rare that an acronym and a longer name that matches the relevant letters would not be a match. Of course, explicit lists of pairs can also be used; similarly, lists of exceptions can also be added. ANNIE’s Orthomatcher is a good example of a co-reference tool which relies entirely on hand-coded rules, performing on news texts with around 95% accuracy [37]. The Stanford Coref tool is integrated in the Stanford CoreNLP pipeline, and implements the multi-pass sieve co-reference and anaphor resolution system described in [38]. SANAPHOR [39] extends this further by adding a semantic layer on top of this and improving the results. It takes as input co-reference clusters generated by Stanford Coref, and then splits those containing unrelated mentions, and merges those which should belong together. It uses the output from an NEL process involving DBpedia/YAGO to disambiguate those mentions which are linked to different entities, and merges those which are linked to the same one. It can also be used with other NERC and NEL tools as input.
3.6 APPROACHES TO NERC
Approaches to NERC can be roughly divided into rule- or pattern-based, and machine learning or statistical extraction methods [40], although quite often the two techniques are mixed (see [41][42][43]). Most learning-based techniques rely on some form of human supervision, with the exception of purely structural IE techniques performing unsupervised machine learning on unannotated documents [44]. As we have already seen, language engineering platforms, such as GATE, Stanford CoreNLP, OpenNLP, and NLTK, enable the modular implementation of techniques and algorithms for information extraction, by inserting different pre-processing and NERC modules into the pipeline, thereby allowing repeatable experimentation and evaluation of their results. An example of a typical processing pipeling for NERC is shown in Figure 3.1.
Figure 3.1: Typical NERC pipeline.
3.6.1 RULE-BASED APPROACHES TO NERC
Linguistic rule-based methods for NERC, such as those used in ANNIE, GATE’s information extraction system, typically comprise a combination of gazetteer lists and hand-coded pattern-matching rules. These rules use contextual information to help determine whether candidate entities from the gazetteers are valid, or to extend the set of candidates. The gazetteer lists act as a starting point from which to establish, reject, or refine the final entity to be extracted. A typical NERC processing pipeline consists of linguistic pre-processing (tokenization, sentence splitting, POS tagging) as described in the pervious chapter, followed by entity finding using gazetteers and grammars, and then co-reference resolution.
Gazetteer lists are designed for annotating simple, regular features such as known names of companies, locations, days of the week, famous people, etc. A typical set of gazetteers for NERC might contain hundreds or even thousands of entries. However, using gazetteers alone is insufficient for recognizing and classifying entities, because on the one hand many names are too ambiguous (e.g., “London” could be part of an Organization name, a Person name, or just the Location), and on the other hand they cannot specify every named entity (e.g., in English one cannot pre-specify every single possible surname). When gazetteers are combined with other linguistic pre-processing annotations (part-of-speech tags, capitalization, other contextual evidence), however, they can be very powerful.
Using pattern matching for NERC requires the development of patterns over multi-faceted structures that consider many different properties of words, such as orthography (capitalization), morphology, part-of-speech information and so on. Traditional pattern-matching languages, such as PERL, quickly become unmanageable due to complexity, when used for such tasks. Therefore, attribute-value notations are normally used, that allow for conditions to refer to token attributes arising from multiple analysis levels. An example of this is JAPE, the Java-based pattern matching language used in GATE, based on CPSL [45]. JAPE employs a declarative notation that allows for context-sensitive rules to be written and for non-deterministic pattern matching to be performed. The rules are divided into phases (subsets) which run sequentially; each phase typically consists of rules for the same entity type (e.g., Person) or rules that have the same specific requirements for their being run. A variety of priority mechanisms enable dealing with competing rules, which make it possible to handle ambiguity: for example, one can prefer patterns occurring in a particular context, or one can prefer a certain entity type over another in a given situation. Other rule-based mechanisms work in a similar way.
A typical simple pattern-matching rule might try to match all university names, e.g., University of Sheffield, University of Bristol, where the pattern consists of the specific words University of followed by the name of a city. From the gazetteer, we can check for the mention of a city name such as Sheffield or Bristol. A more complex rule might try to identify the name of any organization by looking for a keyword from a gazetteer list, such as Company, Organization, Business, School, etc. occurring together with one or more proper nouns (as found by the POS Tagger), and potentially also containing some function words. While these kinds of rules are quite good at matching typical patterns (and work very well for some entity types such as Persons, Locations, and Dates), they can be highly ambiguous. Compare for example the company name General Motors, the person name General Carpenter, and the phrase Major Disaster (which does not denote any entity), and it can easily be seen that such patterns are insufficient. Learning approaches, on the other hand, may be good at recognizing that disaster is not typically part of a person or organization’s name, because it never occurs as such in the training corpus.
As mentioned above, rule-based systems are developed based on linguistic features, such as POS tags or context information. Instead of manually developing such rules, it is possible to label training examples, then automatically learn rules, using rule learning (also known as rule induction) systems. These automatically induce sets of rules from labeled training examples using supervised learning. They were popular among the early NERC learning systems, and include SRV [46], RAPIER [47], WHISK [48], BWI [49], and LP 2 [50].
3.6.2 SUPERVISED LEARNING METHODS FOR NERC
Rule learning methods were historically