Читать книгу Handbook of Intelligent Computing and Optimization for Sustainable Development - Группа авторов - Страница 80
3.3.1.2 Person Detection
ОглавлениеWe utilize the Mask R-CNN [9] object detection model to obtain masks for each person in a given video frame since we must identify the customers in a given frame before proceeding with further steps. Mask R-CNN is a state-of-the-art deep learning framework for instance segmentation. It improves upon Faster R-CNN [19] by using a new methodology named RoI Align instead of using the existing RoI Pooling which provides 10% to 50% more accurate results for masks [9]. This is achieved with RoI Align which overcomes the location misalignment issue suffered by RoI Pooling, which attempts to fit the blocks of the input feature map. Its key steps are explained below.
Figure 3.2 GMG background subtraction model [18].
1 1. Image Preprocessing: The input image is pre-processed by centering, rescaling, and padding. Pixels are channel-wise centered by taking the mean of pixels across all training and test examples for the three color channels and subtracting the mean from the pixel value of the input image to center the values around 0. Then, the image is scaled to a side length which ranges between 800px and 1333px and then padded such that the sides become a multiple of 32. All the images are resized to 1,024 × 1,024 × 3 to allow for batch training.
2 2. ResNet-101 backbone (Bottom-Up Traversal): Table 3.1 shows the series of network layers through which the input image is passed through. Multiple layers are grouped together into stages Conv1 to Conv5. Each convolution layer is followed by a batch normalization layer and a ReLU activation unit.
3 3. Feature Pyramid Network (Top-Down Traversal): Conv5 (Convolution stage 5) layer output is directly used as feature map M5. However, successive feature maps are generated by downsampling the preceding feature maps from top-down layers by a factor of 2 and combined with corresponding bottom-up convolution stage output via a lateral connection. The layers in the bottom-up pathway are passed through a 1 × 1 convolution layer so that depth can be downsampled to that of the top-down layer for in-place addition to take place. Feature maps M4 and M3 are generated in this manner.
4 4. Each feature map (M3-M5) is passed through a 3 × 3 convolution to generate pyramid feature maps (P5-P3). P5 is passed through a max-pooling layer to generate additional feature pyramid P6.
5 5. The RPN Objectness sub-net consists of three 1 × 1 convolutional layers with a depth of 18 followed by a sigmoid activation function. This detects whether an object is present or not and as to which pyramid feature maps the objects are fed to.
6 6. The RPN box detection sub-net performs regression on bounding boxes and consists of three 1 × 1 convolutional layers with a depth of 36.
7 7. Region Proposal sub-net: The region proposal takes the anchors and output of both RPN Objectness and box detection sub-nets to generate region proposals from which it selects the best 2,000 proposals.
8 8. Box Head: FPN-RoI mapping is performed followed by RoI Align to generate 7 × 7 matrices for each RoI. The input is reshaped and fed through two fully connected layers with 1,024 nodes to result in vectors having a length of 1,024 for all RoIs.
9 9. Classifier Sub-network: The classifier is a fully connected layer that predicts the object class. It has nodes equal to the number of classes and uses softmax activation.
10 10. Bounding Box Regressor: The classifier is a fully connected layer that gives delta value for bounding box coordinates. It has nodes equal to four times the number of classes and uses linear activation.
11 11. Mask Head: The mask head runs parallel to the box head, and the RoIs from RoI Align operation are fed through a layer of four convolution filters which have dimensions of 3 × 3 × 256. The resultant outputs are passed through a 2 × 2 × 256 transposed convolution layer. This is subjected to a 1 × 1 convolutional layer, with the number of “out” channels being equal to the number of classes, one mask for each class, and detection. Masks are rescaled based on bilinear interpolation to input image size and applied to the input image.
Table 3.1 Layers involved in ResNet-101 architecture.
Layer type | Number of iterations | Kernel size (h x w x d) | Number of filters | Stride |
Conv1 (C1) | 1 | 7 × 7 × 3 | 64 | 2 |
Max pool | 1 | 3 × 3 | 1 | 2 |
Conv2 (C2) | 3 | 1 × 1 × 64 | 64 | 1 |
3 × 3 × 64 | 64 | 1 | ||
1 × 1 × 64 | 256 | 1 | ||
Conv3 (C3) | 4 | 1 × 1 × 256 | 128 | 1 |
3 × 3 × 128 | 128 | 1 | ||
1 × 1 × 128 | 512 | 1 | ||
Conv4 (C4) | 23 | 1 × 1× 512 | 256 | 1 |
3 × 3 × 256 | 256 | 1 | ||
1 × 1 × 256 | 1,024 | 1 | ||
Conv5 (C5) | 3 | 1 × 1 × 1024 | 512 | 1 |
3 × 3 × 512 | 512 | 1 | ||
1 × 1 × 512 | 2,048 | 1 |
Mask R-CNN is used to segment and construct pixel-wise masks for each customer in a given video frame. The output of this step is a dictionary of masks and bounding box coordinates that engulfs the detected customers. This data corresponding to the person detection is also used later in Stage 3 of the framework.
To obtain the foreground information, we remove the regions common to the foreground masks obtained by the background subtraction model and Mask R-CNN from the aggregate of the former. This step ensures that the clothing worn by the customer are excluded from the foreground.