where Θ ∈ ℝC × F is a matrix of filter parameters, and Z ∈ ℝN × F is the convolved signal matrix. The first convolutional network for DGCN is the same as Eq. (5.15). The second network replaces the adjacency matrix with a positive pointwise mutual information (PPMI) matrix:
(5.16)
where XP is the PPMI matrix, and DP is the diagonal degree matrix of XP.
When dealing with large‐scale networks, low‐dimensional vector embeddings of nodes in large graphs have proved extremely useful as feature inputs for a wide variety of prediction and graph analysis tasks [18–22]. The basic idea behind node‐embedding approaches is to use dimensionality reduction techniques to distill the high‐dimensional information about a node’s graph neighborhood into a dense vector embedding. These node embeddings can then be fed to downstream machine learning systems and aid in tasks such as node classification, clustering, and link prediction [19–21].
GraphSAGE (SAmple and aggreGatE) [23] is a general inductive framework. The framework generates embeddings by sampling and aggregating features from a node’s local neighborhood.
However, the original work utilizes in Eq. (5.17) only a fixed‐size set of neighbors by uniformly sampling. Three aggregator functions are used.
The mean aggregator could be viewed as an approximation of the convolutional operation from the transductive GCN framework [14], so that the inductive version of the GCN variant could be derived by
(5.18)
The mean aggregator is different from other aggregators because it does not perform the concatenation operation that concatenates
The long short‐term memory (LSTM) aggregator, which has a larger expressive capability, is also used. However, LSTMs process inputs in a sequential manner so that they are not permutation invariant. Reference [23] adapts LSTMs to operate on an unordered set by permutating the node’s neighbors.
Pooling aggregator: In the pooling aggregator, each neighbor’s hidden state is fed through a fully connected layer, after which a max ‐pooling operation is applied to the set of the node’s neighbors:
(5.19)
Note that any symmetric function could be used in place of the max ‐pooling operation here. The operation of GraphSAGE is illustrated in Figure 5.1 [23].
Gate: Several works have attempted to use a gate mechanism such as gate recurrent units (GRUs) [25] or LSTM [26] in the propagation step to mitigate the restrictions in the former GNN models and improve the long‐term propagation of information across the graph structure.
Gated graph neural network (GGNN) [27] uses GRUs in the propagation step, unrolls the recurrence for a fixed number of steps T, and uses backpropagation through time in order to compute gradients. So, the propagation model can be presented as
(5.20)
Figure 5.1 Operation of GraphSAGE: (a) sample neighborhood, (b) aggregate feature information from neighbors, (c) predict graph context and label using aggregated information.
Source: Hamilton et al. [23].
The node v first aggregates message from its neighbors, where Av is the submatrix of the graph adjacency matrix A and denotes the connection of node v with its neighbors. The GRU‐like update functions incorporate information from the other nodes and from the previous time step to update each node’s hidden state. Matrix a gathers the neighborhood information of node v, and z and r are the update and reset gates.
LSTM architecture extensions, referred to as the Child‐Sum Tree‐LSTM and the N‐ary Tree‐LSTM, are presented in [28]. As in standard LSTM units, each Tree‐LSTM unit (indexed by v) contains input and output gates iv and ov, a memory cell cv , and a hidden state hv . Instead of a single forget gate, the Tree‐LSTM unit contains one forget gate fvk for each child k, allowing the unit to selectively incorporate information from each child. The Child‐Sum Tree‐LSTM transition equations are given as
(5.21)
(5.22)
The introduction of separate parameter matrices for each child k allows the model to learn more fine‐grained representations conditioning on the states of a unit’s children than the Child‐Sum Tree‐LSTM.
The two types of Tree‐LSTMs can be easily adapted to the graph. The graph‐structured LSTM in [29] is an example of the N‐ary Tree‐LSTM applied to the graph. However, it is a simplified version since each node in the graph has at most two incoming edges (from its parent and sibling predecessor). Reference [30] proposed another variant of the Graph LSTM based on the relation extraction task. The main difference between graphs and trees is that edges of graphs have labels. Work in [30] utilizes different weight matrices to represent different labels:
(5.23)
where m(v, k) denotes the edge label between node v and k.
The attention mechanism has been successfully used in many sequence‐based tasks such as machine translation [31–33] and machine reading [34]. Work in [35]