Figure 1.5 Inception module.
Figure 1.6 Architecture of GoogleNet.
First layer: Here, input image is 224 × 224 × 3, and the output feature is 112 × 112 × 64. Followed by the convolutional layer uses a kernel of size 7 × 7 × 3 and with step 2. Then, followed by ReLU and max pooling by 3 × 3 kernel with step 2, now the output feature map size is 56 × 56 × 64. Then, do the local response normalization.
Second layer: It is a simplified inception model. Here, 1 × 1 convolution using 64 filters generate feature maps from the previous layer’s output before performing the 3 × 3 (with step 2) convolutions using 64 filters. Then, perform ReLU and local response normalization. Finally, perform a 3 × 3 max pooling with stride 2 to obtain 192 numbers of output of 28 feature maps.
Third layer: Is a complete inception module. The previous layer’s output is 28 × 28 with 192 filters and there will be four branches originating from the previous layer. The first branch uses 1 × 1 convolution kernels with 64 filters and ReLU, generates 64, 28 × 28 feature map; the second branch uses 1 × 1 convolution with 96 kernels (ReLU) before 3 × 3 convolution operation (with 128 filters), generating 128 × 28 × 28 feature map; the third branch use 1 × 1 convolutions with 16 filters (using ReLU) of 32 × 5 × 5 convolution operation, generating 32 × 28 × 28 feature map; the fourth branch contains 3 × 3 max pooling layer and a 1 × 1 convolution operation, generating 32 × 28 × 28 feature maps. And it is followed by concatenation of the generated feature maps that provide an output of 28 × 28 feature map with 258 filters.
The fourth layer is inception module. Input image is 28 × 28 × 256. The branches include 1 × 1 × 128 and ReLU, 1 × 1 × 128 as reduce before 3 × 3 × 192 convolutional operation, 1 × 1 × 32 as reduce before 5 × 5 × 96 convolutional operation, 3 × 3 max pooling with padding 1 before 1 × 1 × 64. The output is 28 × 28 × 128, 28 × 28 × 192, 28 × 28 × 96, and 28 × 28 × 64, respectively for each branch. The final output is 28 × 28 × 480. Table 1.6 shows the parameters of GoogleNet.
1.2.6 ResNet
Usually, the input feature map will be fed through a series of convolutional layer, a non-linear activation function (ReLU) and a pooling layer to provide the output for the next layer. The training is done by the back-propagation algorithm. The accuracy of the network can be improved by increasing depth. Once the network gets converged, its accuracy saturates. Further, if we add more layers, then the performance gets degraded rapidly, which, in turn, results in higher training error. To solve the problem of the vanishing/exploding gradient, ResNet with a residual learning framework [6] was proposed by allowing new layers to fit a residual mapping. When a model is converged than to fit the mapping, it is easy to push the residual to zero. The principle of ResNet is residual learning and identity mapping and skip connections. The idea behind the residual learning is that it feeds the input image to the next convolutional layer and adds them together and performs non-linear activation (ReLU) and pooling.
Table 1.6 Various parameters of GoogleNet.
Layer name | Input size | Filter size | Window size | # Filters | Stride | Depth | # 1 × 1 | # 3 × 3 reduce | # 3 × 3 | # 5 × 5 reduce | # 5 × 5 | Pool proj | Padding | Output size | Params | Ops |
Convolution | 224 × 224 | 7 × 7 | - | 64 | 2 | 1 | 2 | 112 × 112 × 64 | 2.7M | 34M | ||||||
Max pool | 112 × 112 | - | 3 × 3 | - | 2 | 0 | 0 | 56 × 56 × 64 | ||||||||
Convolution | 56 × 56 | 3 × 3 | - |
|