The architecture is a shortcut connection of VGGNet (consists of 3 × 3 filters) that is inserted to form a residual network as shown in figure. Figure 1.7(b) shows 34-layer network converted into the residual network and has lesser training error as compared to the 18-layer residual network. As in GoogLeNet, it utilizes a series of a global average pooling layer and the classification layer. ResNets were capable of learning a network with a maximum depth of 152. Compared to the GoogLeNet and VGGNet, accuracy is better and computationally efficient than VGGNet. ResNet-152 achieves 95.51 top-5 accuracies. Figure 1.7(a) shows a residual block, Figure 1.7(b) shows the architecture of ResNet and Table 1.7 shows the parameters of ResNet.
1.2.7 ResNeXt
The ResNeXt [7] architecture is built based on the advantages of ResNet (residual networks) and GoogleNet (multi-branch architecture) and requires less number of hyperparameters compared to the traditional ResNet. The next defines the next dimension (“cardinality”), an additional dimension on top of the depth and width of ResNet. The input is split channelwise into groups. The standard residual block is replaced with a “split-transform-merge” procedure. This architecture uses a series of residual blocks and uses the following rules. (1) If the spatial maps are of same size, the blocks will split the hyperparameters; (2) The spatial map is pooled by two factors; block width is doubled by two factors. ResNeXt becomes the 1st runner up of ILSVRC classification task and produces better results than ResNet. Figure 1.8 shows the architecture of ResNeXt, and the comparison with REsNet is shown in Table 1.8.
Figure 1.7 (a) A residual block.
Table 1.7 Various parameters of ResNet.
Figure 1.8 Architecture of ResNeXt.
1.2.8 SE-ResNet
Hu et al. [8] proposed a Squeeze-and-Excitation Network (SENet) (first position on ILSVRC 2017 category) with lightweight gating mechanism. This architecture focuses explicitly on model interdependencies between the channels of convolutional features and to achieve dynamic channel-wise feature recalibration. In the squeeze phase, SE block uses global average pooling operation and in the excitation phase uses channel-wise scaling. For an input image of size 224 × 224, the running time of ResNet-50 is 164 ms, whereas it is 167 ms for SE-ResNet-50. Also, SE-ResNet-50 requires ∼3.87 GFLOPs, which shows a 0.26% relative increase over the original ResNet-50. The top-5 error is reduced to 2.251%. Figure 1.9 shows the architecture of SE-ResNet, and Table 1.9 shows ResNet and its comparison with SE-ResNet-50 and SE-ResNeXt-50.
1.2.9 DenseNet
The architecture is proposed by [9], where every layer connect directly with each other so as to ensure maximum information (and gradient) flow. Thus, this model with L layer has L(L+1) connections. A number of dense block (group of layers connected to previous layers) and the transition layer control the complexity of the model. Each dense block adds one channel to the model. Transition layer is used to reduce the number of channels by using the convolutional layer of size 1 × 1 and reduces the width and height of the average pooling layer by a factor of 2 and with a stride of 2. It concatenates all the output feature map