Stochastic gradient descent is very effective, particularly when combined with a momentum term, δ(t-1):
Because stochastic gradient descent does not need to consider the entire training data set when calculating each descent step’s gradient, it is usually faster than batch gradient descent. However, because each iteration is trying to better fit a single observation, some of the gradients might actually point away from the minima. This means that, although stochastic gradient descent generally moves the parameters in the direction of an error minima, it might not do so on each iteration. The result is a more circuitous path. In fact, stochastic gradient descent does not actually converge in the same sense as batch gradient descent does. Instead, it wanders around continuously in some region that is close to the minima (Ng, 2013).
Introduction to ADAM Optimization
The ADAM method applies adjustments to the learned gradients for each individual model parameter in an adaptive manner by approximating second-order information about the objective function based on previously observed mini-batch gradients. The “adaptive movement” nature of the algorithm’s movement is where the name ADAM comes from (Kingma and Ba, 2014).
The ADAM method introduces two new hyperparameters to the mix, (
When the approximated single-to-noise ratio is small, the step size is near zero. This is a nice feature because a lower single-to-noise ratio is an indication of higher uncertainty. Thus, more cautious steps should be taken in the parameter space (Kingma and Ba 2014).
To use ADAM, specify ‘ADAM’ in the METHOD= suboption of the ALGORITHM= option in the OPTIMIZER parameter. The suboptions for β1 and β2, as well as the α and other options, also need to be specified. In the example code below, β1 = .9, β2 = .999 and α = .001.
optimizer={algorithm={method=’ADAM’, beta1=0.9, beta2=0.999, learningrate=.001, lrpolicy=’Step’, gamma=0.5}, minibatchsize=100, maxepochs=200}
Note: The authors of ADAM recommend a β1 value of .9, a β2 value of .999, and an α (learning rate) of .001.
Weight Initialization
Deep learning uses different methods of weight initialization than traditional neural networks do. In neural networks, the hidden unit weights are randomly initialized to ensure that each hidden unit is approximating different areas of relationship between the inputs and the output. Otherwise, each hidden unit would be approximating the same relational variations if the weights across hidden units were identical, or even symmetric. The hidden unit weights are usually randomly initialized to some specified distribution, commonly Gaussian or Uniform.
Traditional neural networks use a standard variance for the randomly initialized hidden unit weights. This can become problematic when there is a large amount of incoming information (that is, a large number of incoming connections) because the variance of the hidden unit will likely increase as the amount of incoming connections increases. This means that the output of the combination function could be more extreme, resulting in a saturated hidden unit (Daniely et al. 2017).
Deep learning methods use a normalized initialization in which the variance of the hidden weights is a function of the amount of incoming information and outgoing information. SAS offers several methods for reducing the variance of the hidden weights. Xavier initialization is one of the most common weight initialization methods used in deep learning. The initialization method is random uniform with variance
where m is the number of input connections (fan-in) and n is the number of output connections (fan-out) (hidden units in current layer).
One potential flaw of the Xavier initialization is that the initialization method assumes a linear activation function, which is typically not the case in hidden units. MSRA was designed with the ReLU activation function in mind because MSRA operates under the assumption of a nonzero mean output by the activation, which is exhibited by ReLU (He et al. 2015). The MSRA initialization method is random Gaussian distribution with a standardization of
SAS includes a second variant of the MSRA, called MSRA2. Similar to the MSRA initialization, the MSRA2 method is a random Gaussian distribution with a standardization of
And it penalizes only for outgoing (fan-out) information.
Note: Weight initializations have less impact over model performance if batch normalization is used because batch normalization standardizes information passed between hidden layers. Batch normalization is discussed later in this chapter.
Consider the following simple example where unit y is being derived from 25 randomly initialized weights. The variance of unit y is larger when the standard deviation is held constant at 1. This means that the values for y are more likely to venture into a saturation region when a nonlinear activation function is incorporated. On the other hand, Xavier’s initialization penalizes the variance for the incoming and outgoing connections, constraining the value of y to less treacherous regions of the activation. See Figures 1.7 and 1.8, noting that these examples assume that there are 25 incoming and outgoing connections.
Figure 1.7: Constant Variance (Standard Deviation = 1)
Figure 1.8: Constant Variance (Standard Deviation =)
Regularization
Regularization is a process of introducing or removing information to stabilize an algorithm’s understanding of data. Regularizations such as early stopping, L1, and L2 have been used extensively in neural networks for many years. These regularizations are still widely used in deep learning, as well. However, there have been advancements in the area of regularization that work particularly well when combined with multi-hidden layer neural networks. Two of these advancements, dropout and batch normalization, have shown significant promise in deep learning models. Let’s begin with a discussion of dropout and then examine batch normalization.
Dropout adds noise to the learning process so that the model is more generalizable. Training an ensemble of deep neural networks with several hundred thousand parameters each might be infeasible. As seen in Figure 1.9, dropout adds noise to the learning process so that the model is more generalizable.
Figure 1.9: Regularization Techniques
The goal of dropout is to approximate an ensemble of many possible model structures through a process that perturbs the learning in an attempt to prevent weights from co-adapting. For example, imagine we are training a neural network to identify human faces, and one of the hidden units used in the model sufficiently captures the mouth. All other hidden units are now relying, at least in some part, on this hidden unit to help identify a face through the presence of the mouth. Removing the hidden unit that captures the mouth forces the remaining hidden units to adjust and compensate. This process pushes each hidden unit to be more of a “generalist” than a “specialist” because each hidden unit