Keywords: Garment detection, pose estimation, object detection, customer analytics, deep learning, computer vision
3.1 Introduction
Recognition of garments is a complex image processing task, which benefits a wide array of applications such as customer behavior analyses, forecasting sales, market segmentation [1], and computer-aided designs for fashion [2–4]. This task has attracted a lot of research interest in the field of image processing [5–8]. Identifying the garments that customers of a store are interested in can further aid the development of the aforementioned use cases.
As a retailer, understanding the needs of the customers and anticipating the future needs of the customer base can be beneficial in aiding to stock up on appealing products. This can prove vital to a growing business strategy in today’s ever so competitive world. The illustrated prospects in this research area motivated us to propose an approach that detects garments from surveillance videos and also indicates the extent to which a person is interested in a particular garment. We believe that the usefulness of analyzing customer behavior using machine learning shows promise for a wide range of applications, especially in those where stakeholders can benefit from these analyses. It is noteworthy that the task of garment detection has attracted research attention in the field of image processing, however, developing a system to recognize garments that appeal to customers from surveillance videos is a herculean challenge primarily due to several reasons as explained further.
Firstly, there are numerous sub-tasks involved which include customer detection, tracking, and clothing segmentation. Secondly, the identification of complex garments from indoor surveillance footage is complicated as these garments usually comprise of different textures, types of fabric, and a multitude of colors in addition to their deformable nature. Finally, as CCTV cameras are conveniently installed in stores at angles that make them suitable to monitor human behavior, many a time these customers themselves behave as occlusions by blocking regions of garments to be detected, thereby hindering the task of identifying the garments themselves, further escalating the complications.
In this chapter, we propose a novel framework for the task of identifying garments that appeal to customers and these garments are referred to as garments of interest. After the identification of garments in a video frame using background subtraction, we employ the Mask R-CNN object detection model [9] for the task of identifying customers in the store. We use the OpenPose pose estimation framework [10] to obtain the feature points on the human body, enabling us to draw a correlation between a customer’s wrist coordinates and a garment that the customer most recently interacted with. This allows us to derive a confidence score metric between a customer and the garment into consideration.
The chapter is organized as follows: Section 3.2 highlights the related works, Section 3.3 elucidates the proposed approach, Section 3.4 highlights the results obtained by the proposed approach, Section 3.5 highlights the major findings of this study, and Section 3.6 delineates the conclusion and scope for future work in this field.
3.2 Literature
Bu et al. [11] proposed a Multi-Depth Dilated Network (MDDNet) for the identification of landmarks on fashion items. Since garments and fashion items are often occluded in the environment of detection, the authors proposed an approach to identify the fashion landmarks by the introduction of a Multi-Depth Dilated (MDD) block. These MDD blocks are composed of a different number of dilated convolutions in parallel and are utilized in the MDDNet. A Batch-level Online Hard Keypoint Mining (B-OHKM) method is also proposed to extract hard-to-identify fashion landmarks during the training stage of the network, thus enabling the network to be trained in a manner that improves the performance of such landmarks. Although this approach achieves state-of-the-art performance on fashion dataset benchmarks, it is only effective in identifying generic clothing items such as shirts, pants, and skirts and cannot guarantee good results on complex garments with different textures and color overlays.
Yu et al. [12] proposed a model to identify fashion landmarks by enforcing structural layout relationships among landmarks that utilize multiple stacked Layout Graph Reasoning (LGR) layers. The authors define a graph called layout graph, which is a hierarchical structure with a root node, body-part nodes (eg. upper body, lower body), coarse clothes-part (eg. sleeves) nodes, and leaf nodes. Each LGR layer maps features into these structural graph nodes, performs reasoning over them using a LGR module, and then maps the graph nodes back to the features to enhance their representation. The reasoning module uses a graph clustering operation to get the representations of the intermediate nodes and performs a graph deconvolution operation over the entire graph. After stacking multiple such LGR layers in a convolutional network, a 1×1 convolution with a sigmoid activation function is utilized to produce the final fashion landmark heatmaps. Although the approach performs well in detecting landmarks of garments, the same performance metrics cannot be translated to scenarios in video surveillance, which often include occluded garments that need to be consistently detected on a per-frame basis.
Ge et al. [13] proposed DeepFashion2, a benchmark for detection, pose estimation, segmentation, and re-identification of clothing images. In addition to creating an expansive dataset comprising of 491,000 images of cloths, the authors proposed a model called Match R-CNN, which is based on the Mask R-CNN object detection model proposed by He et al. [9]. Match R-CNN is an end-to-end framework that jointly performs clothes detection, landmark estimation, instance segmentation, and customer-to-shop retrieval. Different streams are used and a siamese model is stacked on top of these streams to aggregate the learned features. Match R-CNN comprises three components, namely, a Feature Network (FN), a Perception Network (PN), and a Matching Network (MN). FN builds a pyramid of feature maps and RoIAlign is used to extract features from different levels of the pyramid. PN contains three streams of networks: landmark estimation, clothes detection, and mask prediction. The RoI features are fed into these streams of the PN. MN contains a feature extractor and a similarity learning network for clothes retrieval, which is used for recognition. Although the Match R-CNN model is state-of-the-art when it comes to identifying garments, it is only trained to identify fashion images that are available in the DeepFashion2 dataset, which although covers a wide array of clothing items, it does not cater to garments such as Indian sarees.
Hara et al. [14] proposed a CNN-based algorithm for the task of fashion items detection that combines the background information of human pose skeletons to detect fashion items. The authors consider the dynamic rigidity of the garments while using human pose estimation models to get coordinates of those garments that are close to the detected human pose coordinates. However, the use of R-CNN as the baseline object detection model significantly increases training cost in both space and time and results in slow object detection, when compared with other state-of-the-art object detection frameworks, for instance, Mask R-CNN.
Kita et al. [15] proposed a deformable-model-driven method to identify hanging garments. The authors recognize the state of a garment