(5.86)
holds. Thus, (vec (Ru,v))′ · ∂vec(An,u)/∂w is the sum of four contributions. In order to derive a method of computing those terms, let Ia denote the a × a identity matrix. Let ⊗ be the Kronecker product, and suppose that Pa is a a2 × a matrix such that vec(diag (v) = Pa v for any vector v ∈ Ra. By the Kronecker product’s properties, vec(AB) = (B′ ⊗ Ia) · vec(A) holds for matrices A, B, and Ia having compatible dimensions [67]. Thus, we have
which implies
Similarly, using the properties vec(ABC) =(C′ ⊗ A) · vec(B) and vec(AB) =(Ia ⊗ A) · vec(B), it follows that
where dh is the number of hidden neurons. Then, we have
(5.89)
where the aforementioned Kronecker product properties have been used.
It follows that (vec (Ru,v))′ · ∂vec(An,u)/∂w can be written as the sum of the four contributions represented by Eqs. (5.87)–(5.90). The second and the fourth terms – Eqs. (5.88) and (5.90) – can be computed directly using the corresponding formulas. The first one can be calculated by observing that
holds, where
Required number of operations: The above method includes two tasks: the matrix multiplications of Eqs. (5.87)–(5.90) and the backpropagation as defined by Eq. (5.91). The former task consists of several matrix multiplications. By inspection of Eqs. (5.87)–(5.90), the number of floating point operations is approximately estimated as 2s2 + 12s hih + 10s2 · hih , where hih denotes the number of hidden‐layer neurons implementing the function h. The second task has approximately the same cost as a backpropagation phase through the original function hw. Such a value is obtained from the following observations: for an a × b matrix C and a b × c matrix D, the multiplication CD requires approximately 2abc operations; more precisely, abc multiplications and ac (b − 1) sums. If D is a diagonal b ×b matrix, then CD requires 2ab operations. Similarly, if C is an a × b matrix, D is a b × a matrix, and Pa is the a 2 ×a matrix defined above and used in Eqs. (5.87)–(5.90), then computing vec(CD)Pc costs only 2ab operations provided that a sparse representation is used for Pα . Finally, a1, a2, a3 are already available, since they are computed during the forward phase of the learning algorithm. Thus, the complexity of computing ∂pw/∂w is
1 Instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and =(∂ew/∂o)(∂Gw/∂w)(x, lN): The terms b and c can be calculated by the backpropagation of ∂ew/∂o through the network that implements gw . Since such an operation must be repeated for each node, the time complexity of instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and c = (∂ew/∂o)(∂Gw/∂w)(x, lN) is for all the GNN models.
2 Instruction = z(t)(∂Fw/∂w)(x, l): By definition of Fw, fw , and BP, we have(5.92)
where y = [ln, xu, l(n, u), lu] and BP1 indicates that we are considering only the first part of the output of BP. Similarly
(5.93)
where y = [ln, xu, l(n, u), lu]. These two equations provide a direct method to compute d in positional and nonlinear GNNs, respectively.
For linear GNNs, let