3.1.2RECURRENT NEURAL NETWORKS
Feed-forward and CNNs fail to adequately represent the sequential nature of natural languages. In contrast, RNNs provide a natural model for them.
RNNs updates its state for every element of an input sequence. Figure 3.4 presents an RNN on the left and its application to a natural language text “How are you doing?” on the right. At each time step t, it takes as input the previous state st−1 and the current input element xt, and updates its current state as:
where U and V are model parameters. At the end of an input sequence, it learns a representation, encoding information from the whole sequence.
Figure 3.4: RNNs applied to a sentence.
Figure 3.5: Long-range dependencies. The shown dependency tree is generated using the Stanford CoreNLP toolkit [Manning et al., 2014].
Most work on neural text production has used RNNs due to their ability to naturally capture the sequential nature of the text and to process inputs and outputs of arbitrary length.
3.1.3LSTMS AND GRUS
RNNs naturally permit taking arbitrary long context into account, and so implicitly capture long-range dependencies, a common phenomenon frequently observed in natural languages. Figure 3.5 shows an example of long-range dependencies in a sentence “The yogi, who gives yoga lessons every morning at the beach, is meditating.” A good representation learning method should capture that “the yogi” is the subject of the verb “meditating” in the sentence.
In practice, however, as the length of the input sequence grows, RNNs are prone to losing information from the beginning of the sequences due to vanishing and exploding gradients issues [Bengio et al., 1994, Pascanu et al., 2013]. This is because, in the case of RNNS, back propagation applies through a large number of layers (the multiple layers corresponding to each time step). Since back propagation updates the weights in proportion to the partial derivative (the gradients) of the loss, and because of the sequential multiplication of matrices as the RNN is unrolled, the gradient may become either very large, or (more commonly), very small, effectively causing weights to either explode or never change at the lower/earlier layers. Consequently, RNNs fail to adequately model the long-range dependencies of natural languages.
Figure 3.6: Sketches of LSTM and GRU cells. On the left, i, f, and o are the input, forget, and output gates, respectively. c and
Long short-term memory (LSTM, [Hochreiter and Schmidhuber, 1997]) and gated recurrent unit (GRU, [Cho et al., 2014]) have been proposed as alternative recurrent networks which are better prepared to learning long-distance dependencies. These units are better in learning to memorise only the part of the past that is relevant for the future. At each time step, they dynamically update their states, deciding on what to memorise and what to forget from the previous input.
The LSTM cell (shown in Figure 3.6, left) achieves this using input (i), forget (f), and output (o) gates with the following operations:
where W* and b* are LSTM cell parameters. The input gate (Eq. (3.3)) regulates how much of the new cell state to retain, the forget gate (Eq. (3.2)) regulates how much of the existing memory to forget, and the output gate (Eq. (3.4)) regulates how much of the cell state should be passed forward to the next time step. The GRU cell (shown in Figure 3.6, right), on the other hand, achieves this using update (z) and reset (r) gates with the following operations:
where W* are GRU cell parameters. The update gate (Eq. (3.8)) regulates how much of the candidate activation to use in updating the cell state, and the reset gate (Eq. (3.9)) regulates how much of the cell state to forget. The LSTM cell has separate input and forget gates, while the GRU cell performs both of these operations together using its reset gate.
In a vanilla RNN, the entire cell state is updated with the current activation, whereas both LSTMs and GRUs have the mechanism to keep memory from previous activations. This allows recurrent networks with LSTM or GRU cells to remember features for a long time and reduces the vanishing gradient problems as the gradient back propagates through multiple bounded non-linearities.
LSTMs and GRUs have been very successful in modelling natural languages in recent years. They have practically replaced the vanilla RNN cell from recurrent networks.
3.1.4WORD EMBEDDINGS
One of the key strengths of neural networks is that representation learning happens in a continuous space. For example, an RNN learns a continuous dense representation of an input text by encoding the sequence of words making up that text. At each time step, it takes a word represented as a continuous vector (often called a word embedding). In sharp contrast to pre-neural approaches, where words were often treated as symbolic features, word embeddings provide a more robust and enriched representation of words, capturing their meaning, semantic relationships, and distributional similarities (similarity of context they appear in).
Figure 3.7 represents two-dimensional representation of word embeddings. As can be seen, words that often occur in a similar context (e.g., “battery” and “charger”) are mapped closer to each other compared to words that do not occur in a similar context (e.g., “battery” and “sink”). Word embeddings give a notion of similarity among words that look very different from each other in their surface forms. Due to this continuous representation,