2.5 Discussion
2.5.1 Study Contributions
This chapter has presented a novel architecture for image verification using Siamese networks structure and capsule networks. We have improved the energy function used in Siamese network to extract complex details output by capsules and obtained on par performance as Siamese networks based on convolutional units [7], but using significantly a smaller number of parameters.
Another major objective of this study is duplicating the human ability to understand completely new visual concepts using previously learnt knowledge. Capsule based Siamese networks can learn a well-generalized function that can be effectively extended to previously unseen data. We have evaluated this capability using n-way classification using one-shot learning. The results have shown more than 80.5% classification accuracy with 20 different characters, which the model has no previous experience.
Moreover, the model is evaluated with MNIST dataset, which is considered as a de facto dataset to evaluate image classification model [30]. The proposed methodology of the capsule layers-based Siamese network has shown 51% accuracy in the classification, using only one image for each digit. Latest deep learning models achieve more than 90% accuracy [39], but that is using all the 60K images available in MNIST dataset. The solution proposed by this study has improved the one-shot learning accuracies by using n-shot learning method, that is using n samples from each image class to do the classification. This way accuracies were improved by 23.5% using 20 samples. As depicted in Figure 2.5, even 28-way learning has showed a classification accuracy of 90%, with Omniglot dataset, while MNIST dataset achieved 74.5% accuracy as shown in Table 2.4.
Further, we have extended the Omniglot dataset by adding a new set of characters for Sinhala language. This contains 600 new handwritten characters for 60 characters in the alphabet. The proposed model has given 49% accuracy for Sinhala without any training stage and it has shown a classification accuracy of 56.22% with a training model accuracy using only one reference image, as shown in Table 2.3.
By comparing with the related studies, in Koch et al. [7], the authors of Omniglot dataset, have used a convolutional layer based Siamese network to solve the one-shot problem [6]. They have shown an accuracy of 94% for class independent classification. This is a similar performance as of the proposed capsule layers-based Siamese network model. In contrast, capsule layers achieve this accuracy with 40% fewer parameters. In an experiment with MNIST dataset using one-shot learning, Koch et al. have achieved 70% accuracy [7], Vinyals et al. [27] have shown 72% accuracy, while the proposed capsule layers-based Siamese network model has given 76% accuracy. The approach in Vinylas et al. [27], is based on Memory augmented neural networks (MANN) and has a similar structure to recurrent neural networks with external memory.
2.5.2 Challenges and Future Research Directions
Although the proposed solution has shown more than 50% accuracy, which is the general threshold for the tested languages, for most of the alphabet types in Omniglot dataset, it has used a small set of images to achieve that accuracy. This limitation can be surpassed by using handcrafted features, which is time-consuming.
In the proposed capsule layers-based Siamese network model, the accuracy of the within language classifications depends on two factors: the number of characters in the alphabet and visual difference between characters. Some alphabets have visually similar characters. In such cases, although the number of characters in the alphabet is small, the classification accuracy becomes low. Thus, the system architecture can be improved with the representation of the image features using transfer learning. Here, features can be extracted from each character image, using a pre-trained deep neural network, and those images can pass to the Siamese network.
This study can be extended by integrating model in a complete OCR pipeline incorporating a character segmentation and reconstruction algorithm. Also, it is possible to analyse the applicability of the proposed model with complex datasets such as ImageNet [40] and COCO [41] by deepening the Siamese network. Additionally, the knowledge learnt from printed character classification can be used to classify handwritten characters. Further, the model classification accuracy can be improved by using printed characters to train the network at initial stages and then using handwritten characters. This will allow the network to understand the defining attributes of each character and such dataset can be generated easily.
2.5.3 Conclusion
Character recognition is a critical module in applications such as document scanning and optical character recognition. With the emergence of deep learning techniques, languages like English have achieved high classification accuracies. However, the applicability of those deep learning methods is constrained in low resource languages, because of the lack of well-developed datasets. This study has focused on implementing a viable method for classification of handwritten characters in low resource languages. Due to the restrictions on the size of available dataset, this problem is modelled as a one-shot learning problem and solved using Siamese networks based on Capsule networks. Siamese network is a de facto type of network use in one-shot learning, but when it comes to image-related tasks, they still need a large number of training dataset. However, the use of Capsule layers-based Siamese network, which can mitigate information losses in Convolutional neural networks allowed to train a Siamese network with a small number of parameters, datasets and get on par performance as a convolutional network. This model is tested with Omniglot dataset and achieved 30–85% accuracy for different alphabets. Further, the model has shown a classification accuracy of 74.5% for MNIST dataset.
References
1. Vorugunti, C.S., Gorthi, R.K.S., Pulabaigari, V., Online Signature Verification by Few-Shot Separable Convolution Based Deep Learning. International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp. 1125–1130, 2019.
2. Wu, Y., Liu, H., Fu, Y., Low-shot face recognition with hybrid classifiers, in: IEEE International Conference on Computer Vision Workshops, pp. 1933–1939, 2017.
3. Gui, L.-Y., Wang, Y.-X., Ramanan, D., Moura, J.M., Few-shot human motion prediction via meta-learning, in: European Conference on Computer Vision (ECCV), pp. 432–450, 2018.
4. Fe-Fei, L., A Bayesian approach to unsupervised one-shot learning of object categories, in: 9th IEEE International Conference on Computer Vision, IEEE, pp. 1134–1141, 2003.
5. Arica, N. and Yarman-Vural, F.T., Optical character recognition for cursive handwriting. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 801–813, 2002.
6. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J., One shot learning of simple visual concepts, in: Annual Meeting of the Cognitive Science Society, 2011.
7. Koch, G., Zemel, R., Salakhutdinov, R., Siamese neural networks for one-shot image recognition, in: 32nd International Conference on MachineLearning, Lille, France, pp. 1–8, 2015.
8. Chopra, S., Hadsell, R., Lecun, Y., Learning a similarity metric discriminatively, with application to face verification, in: