with the initial conditions
(3.70)
With this notation, the gradient updating equation regarding the recurrent neuron can be symbolically expressed as
(3.71)
where Wα denotes the set of those entries in W that correspond to the feedback connections.
3.4.3 Advanced RNN Architectures
The most popular RNN architectures for sequence learning evolved from long short‐term memory (LSTM) [8] and bidirectional recurrent neural networks (BRNNs) [9] schemes. The former introduces the memory cell, a unit of computation that replaces traditional nodes in the hidden layer of a network. With these memory cells, networks are able to overcome difficulties with training encountered by earlier recurrent networks. The latter introduces an architecture in which information from both the future and the past are used to determine the output at any point in the sequence. This is in contrast to previous networks, in which only past input can affect the output, and has been used successfully for sequence labeling tasks in natural language processing, among others. The two schemes are not mutually exclusive, and have been successfully combined for phoneme classification [10] and handwriting recognition [11]. In this section, we explain the LSTM and BRNN, and we describe the neural Turing machine (NTM), which extends RNNs with an addressable external memory [12].
LSTM scheme: This was introduced primarily in order to overcome the problem of vanishing gradients. This model resembles a standard RNN with a hidden layer, but each ordinary node in the hidden layer is replaced by a memory cell (Figure 3.19). Each memory cell contains a node with a self‐connected recurrent edge of fixed weight one, ensuring that the gradient can pass across many time steps without vanishing or exploding. To distinguish references to a memory cell and not an ordinary node, we use the subscript c.
Simple RNNs have long‐term memory in the form of weights. The weights change slowly during training, encoding general knowledge about the data. They also have short‐term memory in the form of ephemeral activations, which pass from each node to successive nodes. The LSTM model introduces an intermediate type of storage via the memory cell. A memory cell is a composite unit, built from simpler nodes in a specific connectivity pattern, with the novel inclusion of multiplicative nodes, represented in diagrams by the letter X. All elements of the LSTM cell are enumerated and described below.
Figure 3.19 A long short‐term memory (LSTM) memory cell.
Note that when we use vector notation, we are referring to the values of the nodes in an entire layer of cells. For example, s is a vector containing the value of sc at each memory cell c in a layer. When the subscript c is used, it is to index an individual memory cell.
Input node: This unit, labeled gc , is a node that takes activation in the standard way from the input layer x(t) at the current time step and (along recurrent edges) from the hidden layer at the previous time step h(t − 1). Typically, the summed weighted input is run through a tanh activation function, although in the original LSTM paper, the activation function is a sigmoid.
Input gate: Gates are a distinctive feature of the LSTM approach. A gate is a sigmoidal unit that, like the input node, takes activation from the current data point x(t) as well as from the hidden layer at the previous time step. A gate is so called because its value is used to multiply the value of another node. It is a gate in the sense that if its value is 0, then flow from the other node is cut off. If the value of the gate is 1, all flow is passed through. The value of the input gate ic multiplies the value of the input node.
Internal state: At the heart of each memory cell is a node sc with linear activation, which is referred to in the original work as the “internal state” of the cell. The internal state sc has a self‐connected recurrent edge with fixed unit weight. Because this edge spans adjacent time steps with constant weight, error can flow across time steps without vanishing or exploding. This edge is often called the constant error carousel. In vector notation, the update for the internal state is s(t) = g(t) ⊙ i(t) + s(t − 1) where ⊙ is pointwise multiplication.
Forget gate: These gates fc were introduced to provide a method by which the network can learn to flush the contents of the internal state. This is especially useful in continuously running networks. With forget gates, the equation to calculate the internal state on the forward pass is s(t) = g(t) ⊙ i(t) + f (t) ⊙ s(t − 1).
Output gate: The value vc ultimately produced by a memory cell is the value of the internal state sc multiplied by the value of the output gate oc . It is customary that the internal state first be run through a tanh activation function, as this gives the output of each cell the same dynamic range as an ordinary tanh hidden unit. However, in other works, rectified linear units, which have a greater dynamic range, are easier to train. So it seems plausible that the nonlinear function on the internal state might be omitted.
In the original paper and in most subsequent work, the input node is labeled g. We adhere to this convention but note that it may be confusing as g does not stand for gate. In the original paper, the gates