3.2 MULTI-LAYER PERCEPTRON
3.2.1 ARCHITECTURE BASICS
Figure 3.1 shows an example of a MLP network architecture which consists of three hidden layers, sandwiched between an input and an output layer. In simplest terms, the network can be treated as a black box, which operates on a set of inputs and generates some outputs. We highlight some of the interesting aspects of this architecture in more details below.
Figure 3.1: A simple feed-forward neural network with dense connections.
Layered Architecture: Neural networks comprise a hierarchy of processing levels. Each level is called a “network layer” and consists of a number of processing “nodes” (also called “neurons” or “units”). Typically, the input is fed through an input layer and the final layer is the output layer which makes predictions. The intermediate layers perform the processing and are referred to as the hidden layers. Due to this layered architecture, this neural network is called an MLP.
Nodes: The individual processing units in each layer are called the nodes in a neural network architecture. The nodes basically implement an “activation function” which given an input, decides whether the node will fire or not.
Dense Connections: The nodes in a neural network are interconnected and can communicate with each other. Each connection has a weight which specifies the strength of the connection between two nodes. For the simple case of feed-forward neural networks, the information is transferred sequentially in one direction from the input to the output layers. Therefore, each node in a layer is directly connected to all nodes in the immediate previous layer.
3.2.2 PARAMETER LEARNING
As we described in Section 3.2.1, the weights of a neural network define the connections between neurons. These weights need to be set appropriately so that a desired output can be obtained from the neural network. The weights encode the “model” generated from the training data that is used to allow the network to perform a designated task (e.g., object detection, recognition, and/or classification). In practical settings, the number of weights is huge which requires an automatic procedure to update their values appropriately for a given task. The process of automatically tuning the network parameters is called “learning” which is accomplished during the training stage (in contrast to the test stage where inference/prediction is made on “unseen data,” i.e., data that the network has not “seen” during training). This process involves showing examples of the desired task to the network so that it can learn to identify the right set of relationships between the inputs and the required outputs. For example, in the paradigm of supervised learning, the inputs can be media (speech, images) and the outputs are the desired set of “labels” (e.g., identity of a person) which are used to tune the neural network parameters.
We now describe a basic form of learning algorithm, which is called the Delta Rule.
Delta Rule
The basic idea behind the delta rule is to learn from the mistakes of the neural network during the training phase. The delta rule was proposed by Widrow et al. [1960], which updates the network parameters (i.e., weights denoted by θ, considering 0 biases) based on the difference between the target output and the predicted output. This difference is calculated in terms of the Least Mean Square (LMS) error, which is why the delta learning rule is also referred to as the LMS rule. The output units are a “linear function” of the inputs denoted by x, i.e.,
If pn and yn denote the predicted and target outputs, respectively, the error can be calculated as:
where n is the number of categories in the dataset (or the number of neurons in the output layer). The delta rule calculates the gradient of this error function (Eq. (3.1)) with respect to the parameters of the network: ∂E/∂θij. Given the gradient, the weights are updated iteratively according to the following learning rule:
where t denotes the previous iteration of the learning process. The hyper-parameter η) denotes the step size of the parameter update in the direction of the calculated gradient. One can visualize that no learning happens when the gradient or the step size is zero. In other cases, the parameters are updated such that the predicted outputs get closer to the target outputs. After a number of iterations, the network training process is said to converge when the parameters do not change any longer as a result of the update.
If the step size is unnecessarily too small, the network will take longer to converge and the learning process will be very slow. On the other hand, taking very large steps can result in an unstable erratic behavior during the training process as a result of which the network may not converge at all. Therefore, setting the step-size to a right value is really important for network training. We will discuss different approaches to set the step size in Section 5.3 for CNN training which are equally applicable to MLP.
Generalized Delta Rule
The generalized delta rule is an extension of the delta rule. It was proposed by Rumelhart et al. [1985]. The delta rule only computes linear combinations between the input and the output pairs. This limits us to only a single-layered network because a stack of many linear layers is not better than a single linear transformation. To overcome this limitation, the generalized delta rule makes use of nonlinear activation functions at each processing unit to model nonlinear relationships between the input and output domains. It also allows us to make use of multiple hidden layers in the neural network architecture, a concept which forms the heart of deep learning.
The parameters of a multi-layered neural network are updated in the same manner as the delta rule, i.e.,
But different to the delta rule, the errors are recursively sent backward through the multi-layered network. For this reason, the generalized delta rule is also called the “back-propagation” algorithm. Since for the case of the generalized delta rule, a neural network not only has an output layer but also intermediate hidden layers, we can separately calculate the error term (differential with respect to the desired output) for the output and hidden layers. Since the case of the output layer is simple, we first discuss the error computation for this layer.
Given the error function in Eq. (3.1), its gradient with respect to the parameters in the output layer L for each node i can be computed as follows: