The last equation can be utilized from both directions. Now we can write down
where I is an identity matrix of appropriate size. In Eq. (3.89), the size of I is determined by the number of columns in F, and hence I ∈ ℝD × D. In Eq. (3.90),
Update the parameters – backward propagation: First, we need to compute ∂z/∂vec(xl) and z/∂vec(F), where the first term will be used for backward propagation to the previous (l − 1)th layer, and the second term will determine how the parameters of the current (l−th) layer will be updated. Keep in mind that f, F, and wi refer to the same thing (modulo reshaping of the vector or matrix or tensor). Similarly, we can reshape y into a matrix
Table 3.3 Variables, for the derivation of gradient with ϕ ↔ φ.
Alias | Size and Meaning | |
---|---|---|
X | x l | Hl Wl × Dl, the input tensor |
F | f , w l | HW Dl × D, D kernels, each H × W and Dl channels |
Y | y , x l+1 | Hl + 1 Wl + 1 × Dl + 1, the output, Dl + 1 = D |
ϕ( x l) |
Hl + 1 Wl + 1 × HW Dl, the im2row expansion of x l
|
|
M | Hl + 1 Wl + 1 HW Dl × Hl Wl Dl, the indictor matrix for ϕ( x l) | |
|
|
Hl + 1 Wl + 1 × Dl + 1, gradient for y |
|
|
HW Dl × D, gradient to update the convolution kernels |
|
|
Hl Wl × Dl, gradient for x l, useful for back propagation |
From the chain rule, it is easy to compute ∂z/∂vec(F) as
The first term on the right in Eq. (3.91) is already computed in the (l + 1)‐th layer as ∂z/∂(vec(xl + 1))T. Based on Eq. (3.89), we have
(3.92)