In contrast to Chapter 4, which shows how to improve the decoding part of the encoder-decoder framework, Chapter 5 focusing on encoding and shows how encoders can be modified to better take into account the structure of the input. Indeed, relying on the impressive ability of sequence-to-sequence models to generate text, most of the earlier work on neural approaches to text production systematically encoded the input as a sequence. However, the input to text production often has a non-sequential structure. In particular, knowledge-base fragments are generally viewed as graphs, while the documents which can make up the input to summarisation systems are hierarchical structures consisting of paragraphs which themselves consist of words and sentences. Chapter 5 starts by outlining the shortcomings arising from modelling these complex structures as sequences. It then goes on to introduce different ways in which the structure of the input can be better modelled. For document structure, we discuss hierarchical long short-term memories (LSTMs), ensemble encoders, and hybrid convolutional sequence-to-sequence document encoders. We then review the use of graph-to-sequence, graph-based triple encoders and graph convolutional networks as means to capture the graph structure of e.g., knowledge-based data and Abstract Meaning Representations (AMRs).
Chapter 6 focuses on ways of guiding the learning process so that constraints stemming from the communication goals are better captured. While the standard encoder-decoder framework assumes learning based on the ground truth, usually using cross-entropy, more recent approaches to text production have started investigating alternative methods such as reinforcement learning and multi-task learning (the latter allows signal from other complementary tasks). In Chapter 6, we review some of these approaches, showing for instance how a simplification system can be learned using deep reinforcement learning with a reward of capturing key features of simplified text such as whether it is fluent (perplexity), whether it is different from the source (SARI metrics), and whether it is similar to the reference (BLEU).
Finally, Part III reviews some of the most prominent data sets used in neural approaches to text production (Chapter 7) and mentions a number of open challenges (Chapter 8).
1.3WHAT’S NOT COVERED?
While natural language generation has long been the “parent pauvre” of NLP, with few researchers, small workshops, and relatively few publications, “the effectiveness of neural networks and their impressive ability to generate fluent text”3 has spurred massive interest for the field over the last few years. As a result, research in that domain is progressing at high speed, covering an increasingly large number of topics. In this book, we focus on introducing the basics of neural text production, illustrating its workings with some examples from data-to-text, text-to-text, and meaning-representations-to-text generation. We do not provide an exhaustive description of the state of the art for these applications, however, nor do we cover all areas of text production. In particular, paraphrasing and sentence compression are only briefly mentioned. Generation from videos and image, in particular caption generation, are not discussed,4 as is the whole field of automated creative writing, including poetry generation and storytelling. Evaluation is only briefly discussed. Finally, novel models and techniques (e.g., transformer models, contextual embeddings, and generative models for generation) which have recently appeared are only briefly discussed in the conclusion.
1.4OUR NOTATIONS
We represent words, sentences, documents, graphs, word counts, and other types of observations with Roman letters (e.g., x, w, s, d, W, S, and D) and parameters with Greek letters (e.g., α, β, and θ). We use bold uppercase letters to represent matrices (e.g., X, Y, and Z), and bold lowercase letters to represent vectors (e.g., a, b, and c) for both random variables x and parameters θ. We use [a; b] to denote vector concatenation. All other notations are introduced when they are used.
1In this case, the speech act is omitted as it is the same for all inputs namely, to recommend the restaurant described by the set of attribute-value pairs.
2Other types of input data have also been considered in NLG such as numerical, graphical and sensor data. We omit them here as, so far, these have been less often considered in neural NLG.
3
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
4See Bernardi et al. [2016], Gatt and Krahmer [2018] for surveys of automatic description generation from images and of the Vision-Language Interface.
PART I
Basics
CHAPTER 2
Pre-Neural Approaches
In this chapter, we briefly review the main architectures that were prevalent in pre-neural text generation. These architectures focus on modelling multiple, interacting factors and differ depending on the NLG task they address. More specifically, three main types of pre-neural NLG architectures can be distinguished depending on whether the task is to generate from data from meaning representations or text.
This chapter sets up the stage for the following chapters where early neural approaches will be shown, usually to model text production as a single end-to-end process independently of the specific NLG task being considered and of the number of factors being involved. As we shall see, contrary to pre-neural approaches which use distinct architectures for different NLG tasks, early neural NLG models mostly consist of two sub-modules: an encoder which maps the input into a continuous representation and a decoder which incrementally generates text based on this continuous representation and on the representation of previously generated words.
2.1DATA-TO-TEXT GENERATION
Generating text from data is a multiple-choice problem. Consider, for instance, the example input data and the associated text shown in Figure 2.1.
Generating this text involves making the following choices.
• Content Selection: deciding which parts of the input data should be realised by the text. For instance, in our example, the pass from “purple6” to “purple3” is not mentioned. More generally, a generation system needs to decide which parts of the input should be realised in the output text and which information can be omitted.
• Document Planning: finding an appropriate text structure. In our example, the text structure is simple and consists of a single sentence. When the input data is larger however, document planning requires making decisions regarding the number of sentences to be generated, their respective ordering, and the discourse relations holding between them.
• Lexicalisation: determining which words should be chosen to realise the input symbols. For instance, the input symbol “badPass” is lexicalised by the verbal phrase “to make a bad pass”.
• Referring Expression Generation: determining which referring expression (pronoun, proper name, definite description, etc.) should be used to describe input entities. In our example, all entities are referred to by identifiers which directly translate into proper names. In more complex settings, however, particularly when the generated