Читать книгу Handbook of Intelligent Computing and Optimization for Sustainable Development - Группа авторов - Страница 84

3.3.3.2 Pose Estimation

Оглавление

We determine the coordinates of a person’s wrists using a state-of-the-art pose estimation framework, OpenPose [10]. OpenPose is the first real-time 2D multi-person human pose estimation framework that achieves the tasks of jointly detecting the human body, hand, face, and foot-related key points from a single image. The OpenPose framework identifies a total of 135 feature points in the detected human. This is accomplished using a multi-stage Convolutional Neural Network (CNN) that uses a nonparametric representation called Part Affinity Fields (PAFs) to learn how to associate the body parts with the corresponding humans in the image. The OpenPose multi-stage CNN architecture has three crucial steps:

1 1. The first set of stages predicts the PAFs from the input feature map.

2 2. The second set of stages utilizes the PAFs from the previous layers to refine the prediction of confidence maps detection.

3 3. The final set of detected PAFs and Confidence Maps are passed into a greedy algorithm, which approximates the global solution, by displaying the various key points in the given input image.

The architecture of the CNN used in OpenPose consists of a convolution step that utilizes two consecutive 3×3 convolutional kernels. The convolution is performed in order to reduce the number of computations. Additionally, the output of each of the aforementioned convolutional kernels is concatenated, producing the basic convolution step in the multistage CNN. Before passing the input image (in RGB color space) to the first stage of the network, the image is passed through the first 10 layers of the VGG-19 network to generate a set of feature maps. These feature maps are then passed through the multi-stage CNN pipeline to generate Part Confidence Maps and PAF. A confidence map is a 2D representation of the belief that a given body part can be located in a given pixel of the input image. PAF is a set of 2D vector fields that encodes the orientation and the location of body parts in a given image.

We use the OpenPose framework’s “BODY_25” pose model to extract the spatial coordinates of both wrist landmarks, PL(x, y) and PR(x, y), of a person denoted by keypoints 4 and 7, respectively, as shown in Figure 3.3.


Figure 3.3 Keypoints for pose output [10].

Handbook of Intelligent Computing and Optimization for Sustainable Development

Подняться наверх