L=−1n∑i=1ny(i)logyˆ(i)+1−y(i)log1−yˆ(i).(3.22)
The cross entropy measures the divergence between the probability distributions of predicted and real results. A large cross entropy value means that the two distributions are clearly distinct, while small cross entropy values suggest that the two distributions are close to each other. Compared to the quadratic cost function, the cross entropy often enjoys fast convergence and is more likely associated with the global optimization.
The idea of cross entropy can be extended to the cases of multi-classification tasks, leading to the concept of the multi-class cross entropy (see appendix A for more details).
Poisson
The Poisson loss function offers a measure on how the predicted distribution diverges from the expected distribution. The Poisson distribution is widely used for modeling photon-count data, which is highly relevant to medical imaging signals. The Poisson loss function is computed as follows:
L=1n∑i=1nyˆ(i)−y(i)·logyˆ(i).(3.23)
Cosine proximity
The cosine proximity function calculates the angular distance between the predicted and real vectors in terms of their inner product. It is well known that the inner product of the two unit vectors gives the cosine value of the angle between them. The formula is as follows:
L=−y·yˆ∥y∥2·∥yˆ∥2=−∑i=1ny(i)·yˆ(i)∑i=1ny(i)2·∑i=1nyˆ(i)2.(3.24)
It is also called cosine similarity. In this context, the vectors are maximally ‘similar’ if they are parallel and maximally ‘dissimilar’ if they are orthogonal.
3.1.7 Backpropagation algorithm
For a multi-layer neural network, the initial weights are randomly initialized. As such, the network does not immediately perform well on data. We need to iteratively adjust these weights based on the error between training labels and predicted values by the network so as to obtain a converged/trained neural network with an appropriate internal representation, which can then map from an input to a desirable output. Such a training process is almost exclusively done using the backpropagation algorithm (Rumelhart et al 1986).
Backpropagation is a shorter term for ‘error backward propagation’, and essentially the gradient descent optimization. In backpropagation, the error or loss is calculated at the output port of the network, and then distributed backwards throughout the network layers. How does the backpropagation work exactly? Let us explain it now, illustrating the backpropagation process with a simple neural network.
Backpropagation is achieved by iteratively calculating the gradient of the loss function layer by layer using the chain rule. In order to illustrate the principle of backpropagation, we rewrite the loss function equation (3.14), as follows:
J(W;b)=argminθ1n∑i=1nLW,b;x(i),y(i).(3.25)
Compared to equation (3.14), the loss function equation (3.25) is a more specific formula that contains all parameters to be trained in the neural network, as well as training data and labels. Each gradient descent iteration updates the parameters W, b in the matrix/vector form as follows:
wij(l+1)=wij(l)−α∂wij(l)J(W,b)(3.26)
bi(l+1)=bi(l)−α∂bi(l)J(W,b),(3.27)
where α is the learning rate, wij(l) denotes the weight between a neuron i in a layer l and the neuron j in the layer (l−1), and bi(l) denotes the bias for the neuron i in the layer l. The key step is to compute the above partial derivatives. The following paragraphs describe the backpropagation algorithm, which is an efficient way to compute these partial derivatives.
As an example, consider a neural network which consists of two input units, one output unit, and no hidden unit, and each neuron only outputs the weighted sum of its input data, as shown in figure 3.15.
Figure 3.15. Toy neural network with two input units and one output unit.
For such a simple neural network, the loss function is as follows:
L=12(y−yˆ)2,(3.28)
where L is the squared error, y is the target output for a training sample, and yˆ is the actual output.
With the sigmoid function as the activation function, for each neuron j, its output oj is equal to
oj=σsj(3.29)
sj=∑k=1nwkjok,(3.30)
where the inner product sj inside a neuron is generally the weighted sum of outputs ok of the previous neurons, and when the neuron in the first layer after the input layer sj is the weighted sum of input data xk, k = 1, 2. The number of input variables to the neuron is often denoted by n, and σ is the sigmoid function defined by equation (3.3).
The activation function σ is nonlinear and differentiable, whose derivative is as follows:
dσdx(x)=σ(x)(1−σ(x)).(3.31)
Then, using the chain rule to calculate the partial derivative of the error to the weight wkj is done using the chain rule:
∂L∂wij=∂L∂oj∂oj∂sj∂sj∂wij.(3.32)
Since in the last factor on the right-hand side only one term in the sum sj depends on wij, we have
∂sj∂wij=∂∂wij∑k=1nwkjok=∂∂wijwijoi=oi.(3.33)
As just mentioned, if the neuron is in the first layer after the input layer, oi is the same as xi. The derivative of the output of neuron j to its input is simply the partial derivative of the activation function:
∂oj∂sj=∂∂sjσsj=σ(sj)(1−σ(sj)).(3.34)
In equation (3.32), the first term can be easy to evaluate when the neuron is in the output layer, because in that case oj=yˆ and
∂L∂oj=∂L∂yˆ=∂∂yˆ12(y−yˆ)2=yˆ−y.(3.35)
However, if j is in a hidden layer of the network, finding the derivative L with respect to oj is less obvious. Let M=u,v,…,w be the set of indices for all neurons that receive inputs from neuron j, we have
∂Loj∂oj=∂Lsu,sv,…,sw∂oj.(3.36)
Then, let us take the total derivative with respect to oj; a recursive expression for the derivative is obtained as follows:
∂Loj∂oj=∑m∈M∂L∂sm∂sm∂oj=∑m∈M∂L∂om∂om∂smWjm.(3.37)
Therefore, the derivative with respect to oj can be calculated if all the derivatives with respect to the outputs om of the next layer, and those for the neurons in the output layer are known. The overall computational process can be expressed as follows:
∂L∂wij=δjoiδj=∂L∂oj∂oj∂sj=(oj−yˆj)oj(1−oj)ifjisanoutputneuron∑m∈Mwjmδmoj(1−oj)ifjisahiddenneuron.(3.38)