(4.25)
As initial values, we define c0 = h0 = 0. After processing the full sequence, a probability distribution over C classes is specified by p, with
where Wi is the i‐th row of the matrix W.
Decomposing the output of an LSTM: We now decompose the numerator of pi in Eq. (4.26) into a product of factors and show that we can interpret those factors as the contribution of individual words to the predicted probability of class i. Define
(4.27)
so that
As tanh (cj) − tanh (cj − 1) can be viewed as the update resulting from word j, so βi, j can be interpreted as the multiplicative contribution to pi by word j.
An additive decomposition of the LSTM Cell: We will show below that βi, j captures some notion of the importance of a word to the LSTM’s output. However, these terms fail to account for how the information contributed by word j is affected by the LSTM’s forget gates between words j and T. Consequently, it was empirically found [93] that the importance scores from this approach often yield a considerable amount of false positives. A more nuanced approach is obtained by considering the additive decomposition of cT in Eq. (4.28), where each term ej can be interpreted as the contribution to the cell state cT by word j. By iterating the equation
This suggests a natural definition of an alternative score to βi, j , corresponding to augmenting the cj terms with the products of the forget gates to reflect the upstream changes made to cj after initially processing word j:
(4.29)
We now introduce a technique for using our variable importance scores to extract phrases from a trained LSTM. To do so, we search for phrases that consistently provide a large contribution to the prediction of a particular class relative to other classes. The utility of these patterns is validated by using them as input for a rules‐based classifier. For simplicity, we focus on the binary classification case.
Phrase extraction: A phrase can be reasonably described as predictive if, whenever it occurs, it causes a document to both be labeled as a particular class and not be labeled as any other. As our importance scores introduced above correspond to the contribution of particular words to class predictions, they can be used to score potential patterns by looking at a pattern’s average contribution to the prediction of a given class relative to other classes. In other words, given a collection of D documents