Table 1.4 Various parameters of ZFNet.
Layer name | Input size | Filter size | Window size | # Filters | Stride | Padding | Output size | # Feature maps | # Connections |
Conv 1 | 224 × 224 | 7 × 7 | - | 96 | 2 | 0 | 110 × 110 | 96 | 14,208 |
Max-pooling 1 | 110 × 110 | 3 × 3 | - | 2 | 0 | 55 × 55 | 96 | 0 | |
Conv 2 | 55 × 55 | 5 × 5 | - | 256 | 2 | 0 | 26 × 26 | 256 | 614,656 |
Max-pooling 2 | 26 × 26 | - | 3 × 3 | - | 2 | 0 | 13 × 13 | 256 | 0 |
Conv 3 | 13 × 13 | 3 × 3 | - | 384 | 1 | 1 | 13 × 13 | 384 | 885,120 |
Conv 4 | 13 × 13 | 3 × 3 | - | 384 | 1 | 1 | 13 × 13 | 384 | 1,327,488 |
Conv 5 | 13 × 13 | 3 × 3 | - | 256 | 1 | 1 | 13 × 13 | 256 | 884,992 |
Max-pooling 3 | 13 × 13 | - | 3 × 3 | - | 2 | 0 | 6 × 6 | 256 | 0 |
Fully connected 1 | 4,096 neurons | 37,752,832 | |||||||
Fully connected 2 | 4,096 neurons | 16,781,312 | |||||||
Fully connected 3 | 1,000 neurons | 4,097,000 | |||||||
Softmax | 1,000 classes | 62,357,608 (Total) |
Figure 1.4 Architecture of VGG-16.
1.2.5 GoogLeNet
In 2014, Google [5] proposed the Inception network for the ImageNet Challenge in 2014 for detection and classification challenges. The basic unit of this model is called “Inception cell”—parallel convolutional layers with different filter sizes, which consists of a series of convolutions at different scales and concatenate the results; different filter sizes extract different feature map at different scales. To reduce the computational cost and the input channel depth, 1 × 1 convolutions are used. In order to concatenate properly, max pooling with “same” padding is used. It also preserves the dimensions. In the state-of-art, three versions of Inception such as Inception v2, v3, and v4 and Inception-ResNet are defined. Figure 1.5 shows the inception module and Figure 1.6 shows the architecture of GoogLeNet.
For each image, resizing is performed so that the input to the network is 224 × 224 × 3 image, extract mean before feeding the training image to the network. The dataset contains 1,000 categories, 1.2 million images for training, 100,000 for testing, and 50,000 for validation. GoogLeNet is 22 layers deep and uses nine inception modules, and global average pooling instead of fully connected layers to go from 7 × 7 × 1,024 to 1 × 1 × 1024, which, in turn, saves a huge number of parameters. It includes several softmax output units to enforce regularization. It is trained on a high-end GPUs within a week and achieved top-5 error rate of 6.67%. GoogleNet trains faster