Table 3.1 Multi‐layer network notation.
|
Weight connecting neuron i in layer l − 1 to neuron j in layer l |
|
Bias weight for neuron j in layer l |
|
Summing junction for neuron j in layer l |
|
Activation (output) value for neuron j in layer l |
|
i‐th external input to network |
|
i‐th output to network |
Define an input vector x = [x0, x1, x2, … xN] and output vector y = [y0, y1, y2, … yM]. The network maps, y = N(w, x), the input x to the outputs y using the weights w. Since fixed weights are used, this mapping is static; there are no internal dynamics. Still, this network is a powerful tool for computation.
It has been shown that with two or more layers and a sufficient number of internal neurons, any uniformly continuous function can be represented with acceptable accuracy. The performance rests on the ways in which this “universal function approximator” is utilized.
3.1.2 Weights Optimization
The specific mapping with a network is obtained by an appropriate choice of weight values. Optimizing a set of weights is referred to as network training. An example of supervised learning scheme is shown in Figure 3.3. A training set of input vectors associated with the desired output vector, {(x1, d1), … (xP, dP)}, is provided. The difference between the desired output and the actual output of the network, for a given input sequence x, is defined as the error
(3.3)
The overall objective function to be minimized over the training set is the given squared error
(3.4)
The training should find the set of weights w that minimizes the cost J subject to the constraint of the network topology. We see that training a neural network represent a standard optimization problem.
A stochastic gradient descent (SGD) algorithm is an option as an optimization method. For each sample from the training set, the weights are adapted as
where
Backpropagation: This is a standard way to find
Single neuron case – Consider first a single linear neuron, which we may describe compactly as
(3.6)
where w = [w0, w1, … wN] and x = [1, x1, … xN]. In this simple setup
Figure 3.3 Schematic representation of supervised learning.
(3.7)
so that Δw = 2μex. From this, we have Δwi = 2μexi , which is the least mean square (LMS) algorithm.
In a multi‐layer network, we just formally extend this procedure. For this we use the chain rule
(3.8)