1.2 AMBIGUITY
It is impossible for computers to analyze language correctly 100% of the time, because language is highly ambiguous. Ambiguous language means that more than one interpretation is possible, either syntactically or semantically. As humans, we can often use world knowledge to resolve these ambiguities and pick the correct interpretation. Computers cannot easily rely on world knowledge and common sense, so they have to use statistical or other techniques to resolve ambiguity. Some kinds of text, such as newspaper headlines and messages on social media, are often designed to be deliberately ambiguous for entertainment value or to make them more memorable. Some classic examples of this are shown below:
• Foot Heads Arms Body.
• Hospitals Sued by 7 Foot Doctors.
• British Left Waffles on Falkland Islands.
• Stolen Painting Found by Tree.
In the first headline, there is syntactic ambiguity between the proper noun (Michael) Foot, a person, and the common noun foot, a body part; between the verb and plural noun heads, and the same for arms. There is also semantic ambgiuity between two meanings of both arms (weapons and body parts), and body (physical structure and a large collection). In the second headline, there is semantic ambiguity between two meanings of foot (the body part and the measurement), and also syntactic ambiguity in the attachment of modifiers (7 [Foot Doctors] or [7 Foot] Doctors). In the third example, there is both syntactic and semantic ambiguity in the word Left (past tense of the verb, or a collective noun referring to left-wing politicians). In the fourth example, there is ambiguity in the role of the preposition by (as agent or location). In each of these examples, for a human, one meaning is possible, and the other is either impossible or extremely unlikely (doctors who are 7-foot tall, for instance). For a machine, understanding without additional context that leaving pastries in the Falkland Islands, though perfectly possible, is an unlikely news item, is almost impossible.
1.3 PERFORMANCE
Due not only to ambiguity, but a variety of other issues, as will be discussed throughout the book, performance on NLP tasks varies widely, both between different tasks and between different tools. Reasons for the variable performance of different tools will be discussed in the relevant sections, but in general, the reason for this is that some tools are good at some elements of the task but bad at others, and there are many issues regarding performance when tools are trained on one kind of data and tested on another. The reason for performance between tasks varying so widely is largely based on complexity, however.
The influence of domain dependence on the effectiveness of NLP tools is an issue that is all too frequently overlooked. For the technology to be suitable for real-world applications, systems need to be easily customizable to new domains. Some NLP tasks in particular, such as Information Extraction, have largely focused on narrow subdomains, as will be discussed in Chapters 3 and 4. The adaptation of existing systems to new domains is hindered by various bottlenecks such as training data acquisition for machine learning–based systems. For the adaptation of Semantic Web applications, ontology bottlenecks may be one of the causes, as will be discussed in Chapter 6.
An independent, though related, issue concerns the adaptation of existing systems to different text genres. By this we mean not just changes in domain, but different media (e.g., email, spoken text, written text, web pages, social media), text type (e.g., reports, letters, books), and structure (e.g., layout). The genre of a text may be influenced by a number of factors, such as author, intended audience, and degree of formality. For example, less formal texts may not follow standard capitalization, punctuation, or even spelling formats, all of which can be problematic for the intricate mechanisms of IE systems. These issues will be discussed in detail in Chapter 8.
Many natural language processing tasks, especially the more complex ones, only become really accurate and usable when they are tightly focused and restricted to particular applications and domains. Figure 1.3 shows a three-dimensional tradeoff graph between generality vs. specificity of domain, complexity of the task, and performance level. From this we can see that the highest performance levels are achieved in language processing tasks that are focused on a specific domain and that are relatively simple (for example, identifying named entities is much simpler than identifying events).
Figure 1.3: Performance tradeoffs for NLP tasks.
In order to make feasible the integration of semantic web applications, there must be some kind of understanding reached between semantic web and NLP practitioners as to what constitutes a reasonable expectation. This is of course true for all applications where NLP should be integrated. For example, some applications involving NLP may not be realistically usable in the real world as standalone automatic systems without human intervention. This is not necessarily the case, however, for other kinds of semantic web applications which do not rely on NLP. Some applications are designed to assist a human user rather than to perform the task completely autonomously. There is often a tradeoff between the amount of autonomy that will most benefit the end user. For example, information extraction systems enable the end user to avoid having to read in detail hundreds or even thousands of documents in order to find the information they want. For humans to search manually through millions of documents is virtually impossible. On the other hand, the user has to bear in mind that a fully automated system will not be 100% accurate, and it is important for the design of the system to be flexible in terms of the tradeoff between precision and recall. For some applications, it may be more important to retrieve everything, although some of the information retrieved may be incorrect; on the other hand, it may be more important that everything retrieved is accurate, even if some things are missed.
1.4 STRUCTURE OF THE BOOK
Each chapter in the book is designed to introduce a new concept in the NLP pipeline, and to show how each component builds on the previous components described. In each chapter we outline the concept behind the component and give examples of common methods and tools. While each chapter stands alone to some extent, in that it refers to a specific task, the chapters build on each other. The first five chapters are therefore best read sequentially.
Chapter 2 describes the main approaches used for NLP tasks, and explains the concept of an NLP processing pipeline. The linguistic processing components comprising this pipeline—language identification, tokenization, sentence splitting, part-of-speech tagging, morphological analysis, and parsing and chunking—are then described, and examples are given from some of the major NLP toolkits.
Chapter 3 introduces the task of named entity recognition and classification (NERC), which is a key component of information extraction and semantic annotation systems, and discusses its importance and limitations. The main approaches to the task are summarized, and a typical NERC pipeline is described.
Chapter 4 describes the task of extracting relations between entities, explaining how and why this is useful for automatic knowledge base population. The task can involve either extracting binary relations between named entities, or extracting more complex relations, such as events. It describes a variety of methodologies and a typical extraction pipeline, showing the