Wednesday, October 16, 2019, I've described how to implement PoseEstimation in PyTorch.

JetsonNano - Human Pose estimation using Pytorch, torchvision

It was explained based on JetPack 4.4 and PyTorch 1.6. But now some links are not complete and contents are outdated. Therefore, I will try to implement PoseEstimation again after installing PyTorch 1.9 from JetPack 4.6, the most recent version as of August 2021. In terms of content, there is no significant difference from the previous article.

Prerequisites

Before you build Pytorch, torchvision, you must pre install these packages.

apt-get install libjpeg-dev zlib1g-dev

Install PyTorch

PyTorch should not be downloaded from the PyTorch website, but must be downloaded from the link below and installed. The installation file below is built for NVidia Jetson series.

Before installing pytorch , visit this site to check the latest pytorch version.

Before installing torchvision , visit this site to check the latest torchvision version.

The latest version at this time is PyTorch 1.9. Download and install the file below.

Delete old versions of PyTorch

First check pre-installed PyTorch. If there is no PyTorch version already installed, proceed to the next step.

root@spypiggy-nano:/usr/local/src/detr# pip3 freeze|grep torch
torch==1.1.0
torchvision==0.3.0

root@spypiggy-nano:/usr/local/src/detr# pip3 uninstall  torchvision==0.3.0
root@spypiggy-nano:/usr/local/src/detr# pip3 uninstall  torch==1.1.0

Download and install Pytorch whl file

We always use Python 3.X. Therefore, download the whl file that can be used in Python 3.6. And install the necessary packages as follows.

apt-get install python3-pip libopenblas-base libopenmpi-dev 
pip3 install Cython
wget -O torch-1.9.0-cp36-cp36m-linux_aarch64.whl https://nvidia.box.com/shared/static/h1z9sw4bb1ybi0rm3tu8qdj8hs05ljbm.whl
pip3 install torch-1.9.0-cp36-cp36m-linux_aarch64.whl

Download and install Pytorch whl file

If you have successfully installed PyTorch 1.9.0, install Torchvision 0.10.0. The latest version of torchvision can be found at https://github.com/pytorch/vision/releases.

sudo apt-get install libjpeg-dev zlib1g-dev libfreetype6-dev
wget https://github.com/pytorch/vision/archive/v0.10.0.tar.gz
tar -xvzf v0.10.0.tar.gz
cd vision-0.10.0
#This takes very long time, have a coffee time
sudo python3 setup.py install

Let's check whether the installation is correct. If you see the screen like this, the installation is successful.

root@spypiggyNano:/usr/local/src# python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.__version__
'1.9.0'
>>> torchvision.__version__
'0.10.0a0'

Be Careful : You must use python3, pip3 commands. PyTorch 1.5 and later, Python 2 is no longer supported.

Installation Sample Codes for Pose Estimation

Now we have finished installing Pytorch, torchvision. It's time to install sample python codes to proceed. I'll use the codes from https://github.com/kairess/torchvision_walkthrough.git .

cd /usr/local/src
git clone https://github.com/kairess/torchvision_walkthrough.git
cd /usr/local/src//torchvision_walkthrough

Now you can find several sample files to test. some files are jupyter notebook files. The author of these codes use a MacBook. So the sample codes do not take GPU(cuda) into account. I'm going to modify the sample codes to sue CUDA. Using CUDA in pytorch is about 10 times faster!

Keypoint detection comparison of performance with or without cuda

The example code below is slightly modified to use CUDA.

import torch
import torchvision
from torchvision import models
import torchvision.transforms as T

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import argparse
import sys, time

IMG_SIZE = 480
THRESHOLD = 0.95


parser = argparse.ArgumentParser(description="Keypoint detection. - Pytorch")
parser.add_argument("--cuda", action="store_true")
args = parser.parse_args()

if True == torch.cuda.is_available():
    print('pytorch:%s GPU support'% torch.__version__)
else:
    print('pytorch:%s GPU Not support ==> Error:Jetson should support cuda'% torch.__version__)
    sys.exit()
print('torchvision', torchvision.__version__)

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()
if(args.cuda):
    model = model.cuda()

#img = Image.open('imgs/07.jpg')
img = Image.open('imgs/apink1.jpg')
img = img.resize((IMG_SIZE, int(img.height * IMG_SIZE / img.width)))

plt.figure(figsize=(16, 16))
plt.imshow(img)


trf = T.Compose([
        T.ToTensor()
        ])

input_img = trf(img)
print(input_img.shape)
if(args.cuda):
    input_img = input_img.cuda()

#The first result is time consuming. After the second, check the processing time with the result.
model([input_img])
fps_time  = time.perf_counter()
out = model([input_img])[0]print(out.keys())


codes = [
    Path.MOVETO,
    Path.LINETO,
    Path.LINETO
]

fig, ax = plt.subplots(1, figsize=(16, 16))
ax.imshow(img)

for box, score, keypoints in zip(out['boxes'], out['scores'], out['keypoints']):
    if(args.cuda):
        score = score.cpu().detach().numpy()
    else:        
        score = score.detach().numpy()

    if score < THRESHOLD:
        continue

    if(args.cuda):
        box = box.to(torch.int16).cpu().numpy()
        keypoints = keypoints.to(torch.int16).cpu().numpy()[:, :2]
    else:
        box = box.detach().numpy()
        keypoints = keypoints.detach().numpy()[:, :2]

    rect = patches.Rectangle((box[0], box[1]), box[2]-box[0], box[3]-box[1], linewidth=2, edgecolor='b', facecolor='none')
    ax.add_patch(rect)

    # 17 keypoints
    for k in keypoints:
        circle = patches.Circle((k[0], k[1]), radius=2, facecolor='r')
        ax.add_patch(circle)
    
    # draw path
    # left arm
    path = Path(keypoints[5:10:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # right arm
    path = Path(keypoints[6:11:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # left leg
    path = Path(keypoints[11:16:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # right leg
    path = Path(keypoints[12:17:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)

plt.savefig('result.jpg')
fps = 1.0 / (time.perf_counter() - fps_time)

if(args.cuda):
    print('FPS(cuda support):%f'%(fps))
else:    
    print('FPS(cuda not support):%f'%(fps))

<keypoints2.py>

Let's run above code without --cuda options and with --cuda options.

spypiggy@spypiggyNano:/usr/local/src/torchvision_walkthrough$ sudo python3 keypoints2.py
pytorch:1.9.0 GPU support
torchvision 0.10.0a0
torch.Size([3, 335, 480])
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /media/nvidia/NVME/pytorch/pytorch-v1.9.0/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])
FPS(cuda not support):0.021109
spypiggy@spypiggyNano:/usr/local/src/torchvision_walkthrough$ sudo python3 keypoints2.py --cuda
pytorch:1.9.0 GPU support
torchvision 0.10.0a0
torch.Size([3, 335, 480])
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /media/nvidia/NVME/pytorch/pytorch-v1.9.0/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])
FPS(cuda support):0.158586

Be Careful : Note that the model is processed twice in the source code. FPS is calculated as the processing time of the second model. The reason is that the first execution result after loading the model takes a lot of time.

The saved result.jpg file is as follows.

<result.jpg>

Even with CUDA, the speed is only 0.15 FPS. At this speed, it takes about 6.7 seconds to process one frame, making it unsuitable for real-time video stream processing. The main reason is that the resnet50 model used in this article is quite heavy instead of recording excellent accuracy.

Under the hood

Now let's dig deeper.

Torchvision keypoint number and human parts

Torchvision's keypoint numbering is different from OpenPose or Tensorflow's models.

The values are like this.

COCO_PERSON_KEYPOINT_NAMES = [
    'nose', 
    'left_eye',
    'right_eye',
    'left_ear',
    'right_ear',
    'left_shoulder',
    'right_shoulder',
    'left_elbow',
    'right_elbow',
    'left_wrist',
    'right_wrist',
    'left_hip',
    'right_hip',
    'left_knee',
    'right_knee',
    'left_ankle',
    'right_ankle'
]

Result from model

Unlike TensorFlow 1.X, Pytorch is so intuitive that it makes code easier to understand.

Only three lines of code are enough.

First convert a numpy image to tensor, move the variable to cuda if using GPU.

Then insert the tensor to model, the return value is the list of dictionary type. As I inserted one image to model, index 0 of the list is sufficient.

input_img = trf(img)    # Make image to Pytorch tensor
input_img = input_img.to(device)
out = model([input_img])[0]

If you print the out variable's dictionary keys, you can see these key values.

print(out.keys())

dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])

You can see these values' explanations at https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection

But the document is incomplete. They don't explain the keypoints_scores. The remaining values are explained as follows.

boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values between 0 and H and 0 and W
labels (Int64Tensor[N]): the class label for each ground-truth box
keypoints (FloatTensor[N, K, 3]): the K keypoints location for each of the N instances, in the format [x, y, visibility], where visibility=0 means that the keypoint is not visible.

Be careful : keypoints visibility value seems to be not correct. 1 means the point is visible, 0 means the point is invisible(hidden human parts). However, this value is always 1. Therefore, this value has no meaning so far. It is a good way to determine the accuracy of the keypoint region with the keypoints_scores value.

If you input image inference to the network model(keypointrcnn_resnet50_fpn), you can get output dictionary value. In this code, 'for loop' iterates for human counts, and prints keypoints and keypoint_score values.

out = model([input_img])[0]
for box, score, keypoints, kscores in zip(out['boxes'], out['scores'], out['keypoints'], out['keypoints_scores'] ):
    score = score.cpu().detach().numpy()
    box = box.cpu().detach().numpy()
    points = keypoints.cpu().detach().numpy()
    kscores = kscores.cpu().detach().numpy()
    print(kscores)
    print(points)

Let's check the keypoints_score and keypoints of this picture. This is an inference image. The lower body is not visible in this picture.

<03.jpg>

Run this command.

spypiggy@spypiggyNano:/usr/local/src/torchvision_walkthrough$ sudo python3 keypoints2.py --image=./imgs/03.jpg --cuda
pytorch:1.9.0 GPU support
torchvision 0.10.0a0
torch.Size([3, 719, 480])
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /media/nvidia/NVME/pytorch/pytorch-v1.9.0/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])
scores: 0.998314380645752
kscores: [  1.25098372   0.82812518   2.83464313  10.02956772  10.867136
   7.50433922   9.46248245   5.73677111   4.82222462  -1.08437061
  -2.38515258  -2.25649142  -2.69200468  -3.09163022  -2.31611848
  -1.32797825  -1.90648782]
keypoints: [[386 295]
 [193 245]
 [368 264]
 [224 298]
 [352 298]
 [127 452]
 [412 470]
 [112 706]
 [420 706]
 [174 409]
 [419 706]
 [205 706]
 [354 706]
 [217 706]
 [434 578]
 [113 706]
 [421 706]]
FPS(cuda support):0.148066

Let's compare the keypoints_score and it's name. The score at the lower wrist is very low. If you see the above picture, you might understand enough.

nose                 1.250984
left_eye             0.828125
right_eye            2.834643
left_ear             10.029568
right_ear            10.867136
left_shoulder        7.504339
right_shoulder       9.462482
left_elbow           5.736771
right_elbow          4.822225
left_wrist           -1.084371
right_wrist          -2.385153
left_hip             -2.256491
right_hip            -2.692005
left_knee            -3.091630
right_knee           -2.316118
left_ankle           -1.327978
right_ankle          -1.906488

The kscore values of the lower body key points are negative. And looking at the output image result.jpg, the line connecting the feet from the waist was drawn strangely because the -value kscore was not taken into account. The line from the elbow to the wrist was also drawn strangely.

<result.jpg>

Example reflecting kscores

The code below is an improvement not to draw a line when the score of the connection keypoint is - when connecting the arm and leg joints.

import torch
import torchvision
from torchvision import models
import torchvision.transforms as T

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import argparse
import sys, time

IMG_SIZE = 480


parser = argparse.ArgumentParser(description="Keypoint detection. - Pytorch")
parser.add_argument('--image', type=str, default="./imgs/03.jpg", help='inference image')
parser.add_argument('--accuracy', type=float, default=0.9, help='accuracy. default=0.6')
parser.add_argument("--cuda", action="store_true")
args = parser.parse_args()

if True == torch.cuda.is_available():
    print('pytorch:%s GPU support'% torch.__version__)
else:
    print('pytorch:%s GPU Not support ==> Error:Jetson should support cuda'% torch.__version__)
    sys.exit()
print('torchvision', torchvision.__version__)

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()
if(args.cuda):
    model = model.cuda()

img = Image.open(args.image)
#img = Image.open('imgs/apink1.jpg')
img = img.resize((IMG_SIZE, int(img.height * IMG_SIZE / img.width)))

plt.figure(figsize=(16, 16))
plt.imshow(img)


trf = T.Compose([
        T.ToTensor()
        ])

input_img = trf(img)
print(input_img.shape)
if(args.cuda):
    input_img = input_img.cuda()


#The first result is time consuming. After the second, check the processing time with the result.
#model([input_img])
fps_time  = time.perf_counter()
out = model([input_img])[0]
print(out.keys())
t_human = 0
r_human = 0



codes = [
    Path.MOVETO,
    #Path.LINETO,
    Path.LINETO
]

fig, ax = plt.subplots(1, figsize=(16, 16))
ax.imshow(img)
t_human = 0
r_human = 0
for box, score, keypoints, kscores  in zip(out['boxes'], out['scores'], out['keypoints'], out['keypoints_scores'] ):
    if(args.cuda):
        score = score.cpu().detach().numpy()
        kscores = kscores.cpu().detach().numpy()    
        box = box.to(torch.int16).cpu().numpy()
        keypoints = keypoints.to(torch.int16).cpu().numpy()[:, :2]
    else:        
        score = score.detach().numpy()
        box = box.detach().numpy()
        keypoints = keypoints.detach().numpy()[:, :2]
        kscores = kscores.detach().numpy()    

    t_human += 1
    if score < args.accuracy:
        continue
    r_human += 1

    rect = patches.Rectangle((box[0], box[1]), box[2]-box[0], box[3]-box[1], linewidth=2, edgecolor='b', facecolor='none')
    ax.add_patch(rect)

    # 17 keypoints
    #for k in keypoints:
    for x in range(len(keypoints)):
        k = keypoints[x]
        if kscores[x] > 0:
            if x == 5:
                circle = patches.Circle((k[0], k[1]), radius=4, facecolor='r')
            else:
                circle = patches.Circle((k[0], k[1]), radius=2, facecolor='r')
            ax.add_patch(circle)
    
    # draw path
    # left arm
    if kscores[5] > 0 and kscores[7] > 0:
        path = Path(keypoints[5:8:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    if kscores[7] > 0 and kscores[9] > 0:
        path = Path(keypoints[7:10:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    
    # right arm
    if kscores[6] > 0 and kscores[8] > 0:
        path = Path(keypoints[6:9:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    if kscores[8] > 0 and kscores[10] > 0:
        path = Path(keypoints[8:11:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)


    # left leg
    if kscores[11] > 0 and kscores[13] > 0:
        path = Path(keypoints[11:14:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    if kscores[13] > 0  and kscores[15] > 0:
        path = Path(keypoints[13:16:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    

    # right leg
    if kscores[12] > 0 and kscores[14] > 0:
        path = Path(keypoints[12:15:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)
    if kscores[14] > 0 and kscores[16] > 0:
        path = Path(keypoints[14:17:2], codes)
        line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
        ax.add_patch(line)

plt.savefig('result.jpg')
fps = 1.0 / (time.perf_counter() - fps_time)
print('total human:%d  real human:%d'%(t_human, r_human))

if(args.cuda):
    print('FPS(cuda support):%f'%(fps))
else:    
    print('FPS(cuda not support):%f'%(fps))

The result of executing the above code is as follows.

Wrapping up

After about two years since October 2019, I looked at keypoint recognition in PyTorch again. PyTorch has been upgraded in the meantime, but there doesn't seem to be much change in the model and documentation for keypoint recognition.

This Resnet50-based model has high accuracy, but the processing speed is too low to be suitable for real-time video processing on the Jetson Nano. It is well worth considering for image file processing purposes. If you want to detect keypoints in real-time video files on Jetson Nano, please refer to the following links.

The source code of this text can be downloaded from my github.