• Simplicity: Instead of problem specific tweaks and tailored feature detectors, deep networks offer basic architectural blocks, network layers, which are repeated several times to generate large networks.
• Scalability: Deep learning models are easily scalable to huge datasets. Other competing methods, e.g., kernel machines, encounter serious computational problems if the datasets are huge.
• Domain transfer: A model learned on one task is applicable to other related tasks and the learned features are general enough to work on a variety of tasks which may have scarce data available.
Due to the tremendous success in learning these deep neural networks, deep learning techniques are currently state-of-the-art for the detection, segmentation, classification and recognition (i.e., identification and verification) of objects in images. Researchers are now working to apply these successes in pattern recognition to more complex tasks such as medical diagnoses and automatic language translation. Convolutional Neural Networks (ConvNets or CNNs) are a category of deep neural networks which have proven to be very effective in areas such as image recognition and classification (see Chapter 7 for more details). Due to the impressive results of CNNs in these areas, this book is mainly focused on CNNs for computer vision tasks. Figure 1.3 illustrates the relation between computer vision, machine learning, human vision, deep learning, and CNNs.
1.3 BOOK OVERVIEW
CHAPTER 2
The book begins in Chapter 2 with a review of the traditional feature representation and classification methods. Computer vision tasks, such as image classification and object detection, have traditionally been approached using hand-engineered features which are divided into two different main categories: global features and local features. Due to the popularity of the low-level representation, this chapter first reviews three widely used low-level hand-engineered descriptors, namely Histogram of Oriented Gradients (HOG) [Triggs and Dalal, 2005], Scale-Invariant Feature Transform (SIFT) [Lowe, 2004], and Speed-Up Robust Features (SURF) [Bay et al., 2008]. A typical computer vision system feeds these hand-engineered features to machine learning algorithms to classify images/videos. Two widely used machine learning algorithms, namely SVM [Cortes, 1995] and RDF [Breiman, 2001, Quinlan, 1986], are also introduced in details.
Figure 1.3: The relation between human vision, computer vision, machine learning, deep learning, and CNNs.
CHAPTER 3
The performance of a computer vision system is highly dependent on the features used. Therefore, current progress in computer vision has been based on the design of feature learners which minimizes the gap between high-level representations (interpreted by humans) and low-level features (detected by HOG [Triggs and Dalal, 2005] and SIFT [Lowe, 2004] algorithms). Deep neural networks are one of the well-known and popular feature learners which allow the removal of complicated and problematic hand-engineered features. Unlike the standard feature extraction algorithms (e.g., SIFT and HOG), deep neural networks use several hidden layers to hierarchically learn the high level representation of an image. For instance, the first layer might detect edges and curves in the image, the second layer might detect object body-parts (e.g., hands or paws or ears), the third layer might detect the whole object, etc. In this chapter, we provide an introduction to deep neural networks, their computational mechanism and their historical background. Two generic categories of deep neural networks, namely feed-forward and feed-back networks, with their corresponding learning algorithms are explained in detail.
CHAPTER 4
CNNs are a prime example of deep learning methods and have been most extensively studied. Due to the lack of training data and computing power in the early days, it was hard to train a large high-capacity CNN without overfitting. After the rapid growth in the amount of annotated data and the recent improvements in the strengths of Graphics Processor Units (GPUs), research on CNNs has emerged rapidly and achieved state-of-the-art results on various computer vision tasks. In this chapter, we provide a broad survey of the recent advances in CNNs, including state-of-the-art layers (e.g., convolution, pooling, nonlinearity, fully connected, transposed convolution, ROI pooling, spatial pyramid pooling, VLAD, spatial transformer layers), weight initialization approaches (e.g., Gaussian, uniform and orthogonal random initialization, unsupervised pre-training, Xavier, and Rectifier Linear Unit (ReLU) aware scaled initialization, supervised pre-training), regularization approaches (e.g., data augmentation, dropout, drop-connect, batch normalization, ensemble averaging, the ℓ1 and ℓ2 regularization, elastic net, max-norm constraint, early stopping), and several loss functions (e.g., soft-max, SVM hinge, squared hinge, Euclidean, contrastive, and expectation loss).
CHAPTER 5
The CNN training process involves the optimization of its parameters such that the loss function is minimized. This chapter reviews well-known and popular gradient-based training algorithms (e.g., batch gradient descent, stochastic gradient descent, mini-batch gradient descent) followed by state-of-the-art optimizers (e.g., Momentum, Nesterov momentum, AdaGrad, AdaDelta, RMSprop, Adam) which address the limitations of the gradient descent learning algorithms. In order to make this book a self-contained guide, this chapter also discusses the different approaches that are used to compute differentials of the most popular CNN layers which are employed to train CNNs using the error back-propagation algorithm.
CHAPTER 6
This chapter introduces the most popular CNN architectures which are formed using the basic building blocks studied in Chapter 4 and Chapter 7. Both early CNN architectures which are easier to understand (e.g., LeNet, NiN, AlexNet, VGGnet) and the recent CNN ones (e.g., GoogleNet, ResNet, ResNeXt, FractalNet, DenseNet), which are relatively complex, are presented in details.
CHAPTER 7
This chapter reviews various applications of CNNs in computer vision, including image classification, object detection, semantic segmentation, scene labeling, and image generation. For each application, the popular CNN-based models are explained in detail.
CHAPTER 8
Deep learning methods have resulted in significant performance improvements in computer vision applications and, thus, several software frameworks have been developed to facilitate these implementations. This chapter presents a comparative study of nine widely used deep learning frameworks, namely Caffe, TensorFlow, MatConvNet, Torch7, Theano, Keras, Lasagne, Marvin, and Chainer, on different aspects. This chapter helps the readers to understand the main features of these frameworks (e.g., the provided interface and platforms for each framework) and, thus, the readers can choose the one which suits their needs best.
CHAPTER 2
Features and Classifiers
Feature extraction and classification are two key stages of a typical computer vision system. In this chapter, we provide an introduction to these two steps: their importance and their design challenges for computer vision tasks.
Feature extraction methods can be divided into two different categories, namely hand-engineering-based methods and feature learning-based methods. Before going into the details of the feature learning algorithms in the subsequent chapters (i.e., Chapter 3, Chapter 4,