last updated 2021.08.24 : update for introduction to new link

This article was written a long time ago, so some links have errors. If you are trying to implement Pose Estimation by installing Pytorch 1.9 or higher on JetPack 4.5 or 4.6, I recommend looking at the following link.

JetsonNano - Installing the latest Pytorch 1.9 and Pose Estimation

I used Jetson Nano, Ubuntu 18.04 Official image with root account.

In my previous article, I explained pose estimation using Tensorflow, OpenPose .
In this article I'm going to use Pytorch to do a pose estimation.

First I'll examine the torchvision package.
The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision. Surely it requires the Pytorch framework.
Currently(2019.10) torchvision 0.4.x requires Pytorch Ver 1.2.

The torchvision datasets includes MNIST, CIFAR, COCO, and many more. You can find full datasets here.

And torchvision supports many models like AlexNet, ResNet, Inception V3, GoogLeNet, MobileNet V2, ....
For "Pose Estimation", the torchvision supports "Keypoint R-CNN ResNet-50 FPN" model.
You can find a detailed explanations here.

This table shows how much memory the models need. "Keypoint R-CNN ResNet-50 FPN" needs 6.8 GB. This value far exceeds the memory of Jetson Nano. But let's try it out and see if it's successful or not.

Network	train time (s / it)	test time (s / it)	memory (GB)
Faster R-CNN ResNet-50 FPN	0.2288	0.0590	5.2
Mask R-CNN ResNet-50 FPN	0.2728	0.0903	5.4
Keypoint R-CNN ResNet-50 FPN	0.3789	0.1242	6.8

<https://pytorch.org/docs/stable/torchvision/models.html#object-detection-instance-segmentation-and-person-keypoint-detection>

Prerequisites

Before you build Pytorch, torchvision, you must pre install these packages.

apt-get install libjpeg-dev zlib1g-dev

Installation (JetPack 4.3)

Be careful : These packages are upgraded from time to time. So you should check the site first and find the latest version to install. Pytorch version under 1.3 has some problem with cuda(PyTorch issue #8103). So I strongly recommend that you use version 1.3 or higher

If you are using JetPack 4.4, skip to the next serction.

Before installing pytorch 1.3, visit this site to check the latest pytorch version.
Before installing torchvision 0.4.2, visit this site to check the latest pytorch version.

cd /usr/local/src

#First install torch 1.3, numpy 1.16.5

wget https://nvidia.box.com/shared/static/phqe92v26cbhqjohwtvxorrwnmrnfx1o.whl -O torch-1.3.0-cp36-cp36m-linux_aarch64.whl

pip3 install numpy torch-1.3.0-cp36-cp36m-linux_aarch64.whl

#Next install torchvision 0.4.2
git clone -b v0.4.2 https://github.com/pytorch/vision torchvision

cd torchvision
python3 setup.py install

Be careful : If you met the errors about numpy, remove the numpy and reinstall it with version 1.16.5 (pip3 install numpy=1.16.5)

apt-get remove python3-numpy
pip3 install numpy==1.16.5

Let's check whether the installation is correct.
If you see thescreen like this, the installation is successful.

root@spypiggy-desktop:/usr/local/src/study/torchvision_walkthrough# python3
Python 3.6.8 (default, Aug 20 2019, 17:12:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision

Installation (JetPack 4.4)

In my other article Jetson Nano-JetPack 4.4 (production release) and Pytorch 1.6.0 installation, I explained how to install PyTorch, torchvision.

Installation Sample Codes

Now we have finished installing Pytorch, torchvision. It's time to install sample python codes to proceed. I'll use the codes from https://github.com/kairess/torchvision_walkthrough.git .

cd /usr/local/src
git clone https://github.com/kairess/torchvision_walkthrough.git

cd /usr/local/src//torchvision_walkthrough

Now you can find several sample files to test. some files are jupyter notebook files. The author of these codes use a MacBook. So the sample codes do not take GPU(cuda) into account. There are my codes considering the GPU at https://github.com/raspberry-pi-maker/NVIDIA-Jetson/tree/master/tf-pose-estimation. Using CUDA in pytorch is about 10 times faster!

Download models

We will use the keypointrcnn_resnet50_fpn model.
In PyTorch, this model can be accessed by models.detection.keypointrcnn_resnet50_fpn.
If as in the example code below

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()

PyTorch automatically stores the model in the current user's cache directory. The storage path is as follows. ~/.cache/torch/hub/checkpoints/keypointrcnn_resnet50_fpn_coco-fc266e95.pth

If you want to save and use the model in a specific directory in advance without using the cache directory, you can download the model to a specific directory and change the code as follows.

First download the model to the specific directory.

wget http://download.pytorch.org/models/keypointrcnn_resnet50_fpn_coco-fc266e95.pth -O "filename"

Then load the model from the local system.


model = models.detection.keypointrcnn_resnet50_fpn(pretrained=False).eval()
model.load_state_dict(torch.load('/home/spypiggy/src/torchvision_walkthrough/models/keypointrcnn_resnet50_fpn_coco-fc266e95.pth'))

Keypoint detection comparison of performance with or without cuda

import torch
import torchvision
from torchvision import models
import torchvision.transforms as T

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import argparse
import sys, time

IMG_SIZE = 480
THRESHOLD = 0.95


parser = argparse.ArgumentParser(description="Keypoint detection. - Pytorch")
parser.add_argument("--cuda", action="store_true")
args = parser.parse_args()

if True == torch.cuda.is_available():
    print('pytorch:%s GPU support'% torch.__version__)
else:
    print('pytorch:%s GPU Not support ==> Error:Jetson should support cuda'% torch.__version__)
    sys.exit()
print('torchvision', torchvision.__version__)

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()
if(args.cuda):
    model = model.cuda()

#img = Image.open('imgs/07.jpg')
img = Image.open('imgs/apink1.jpg')
img = img.resize((IMG_SIZE, int(img.height * IMG_SIZE / img.width)))

plt.figure(figsize=(16, 16))
plt.imshow(img)


trf = T.Compose([
        T.ToTensor()
        ])

input_img = trf(img)
print(input_img.shape)
if(args.cuda):
    input_img = input_img.cuda()

fps_time  = time.perf_counter()

out = model([input_img])[0]
print(out.keys())


codes = [
    Path.MOVETO,
    Path.LINETO,
    Path.LINETO
]

fig, ax = plt.subplots(1, figsize=(16, 16))
ax.imshow(img)

for box, score, keypoints in zip(out['boxes'], out['scores'], out['keypoints']):
    if(args.cuda):
        score = score.cpu().detach().numpy()
    else:        
        score = score.detach().numpy()

    if score < THRESHOLD:
        continue

    if(args.cuda):
        box = box.to(torch.int16).cpu().numpy()
        keypoints = keypoints.to(torch.int16).cpu().numpy()[:, :2]
    else:
        box = box.detach().numpy()
        keypoints = keypoints.detach().numpy()[:, :2]

    rect = patches.Rectangle((box[0], box[1]), box[2]-box[0], box[3]-box[1], linewidth=2, edgecolor='b', facecolor='none')
    ax.add_patch(rect)

    # 17 keypoints
    for k in keypoints:
        circle = patches.Circle((k[0], k[1]), radius=2, facecolor='r')
        ax.add_patch(circle)
    
    # draw path
    # left arm
    path = Path(keypoints[5:10:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # right arm
    path = Path(keypoints[6:11:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # left leg
    path = Path(keypoints[11:16:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)
    
    # right leg
    path = Path(keypoints[12:17:2], codes)
    line = patches.PathPatch(path, linewidth=2, facecolor='none', edgecolor='r')
    ax.add_patch(line)

plt.savefig('result.jpg')
fps = 1.0 / (time.perf_counter() - fps_time)

if(args.cuda):
    print('FPS(cuda support):%f'%(fps))
else:    
    print('FPS(cuda not support):%f'%(fps))

<keypoints2.py>

Let's run above code without --cuda options

root@spypiggy-desktop:/usr/local/src/study/torchvision_walkthrough# python3 keypoints2.py
pytorch:1.3.0 GPU support
torchvision 0.4.2
torch.Size([3, 720, 480])
dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])
FPS(cuda not support):0.002254

Now let's run above code with --cuda options

root@spypiggy-desktop:/usr/local/src/study/torchvision_walkthrough# python3 keypoints2.py --cuda
pytorch:1.3.0 GPU support
torchvision 0.4.2
torch.Size([3, 720, 480])
dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])
FPS(cuda support):0.070780

As you can see in the above result, by using cuda, you can achieve a speed improvement of more than 10 times.

Under the hood

Now let's dig deeper.

GPU Support and model loading

This code checks to see if it supports CUDA, then it loads model. At first time it might take some seconds for downloading models from server.

These codes are same meaning.

if torch.cuda.is_available():
  device = torch.device('cuda')
  model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
  model = model.to(device)
  model.eval()
else:
  device = torch.device('cpu')
  model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()
if torch.cuda.is_available():
  model.cuda()

Torchvision keypoint number and human parts

Torchvision's keypoint numbering is different from OpenPose or Tensorflow's models.
The values are like this.

COCO_PERSON_KEYPOINT_NAMES = [
    'nose', 
    'left_eye',
    'right_eye',
    'left_ear',
    'right_ear',
    'left_shoulder',
    'right_shoulder',
    'left_elbow',
    'right_elbow',
    'left_wrist',
    'right_wrist',
    'left_hip',
    'right_hip',
    'left_knee',
    'right_knee',
    'left_ankle',
    'right_ankle'
]

Result from model

Unlike TensorFlow, Pytorch is so intuitive that it makes code easier to understand.
Only three lines of code are enough.
First convert a numpy image to tensor, move the variable to cuda if using GPU.
Then insert the tensor to model, the return value is the list of dictionary type. As I inserted one image to model, index 0 of the list is sufficient.

input_img = trf(img)    # Make image to Pytorch tensor
input_img = input_img.to(device)
out = model([input_img])[0]

If you print the out variable's dictionary keys, you can see these key values.

print(out.keys())

dict_keys(['boxes', 'labels', 'scores', 'keypoints', 'keypoints_scores'])

You can see these values' explanations at https://pytorch.org/docs/stable/torchvision/models.html#object-detection-instance-segmentation-and-person-keypoint-detection.

But at this point(2019.10), the document is incomplete. They don't explain the keypoints_scores. The remaining values are explained as follows.

boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values between 0 and H and 0 and W

labels (Int64Tensor[N]): the class label for each ground-truth box

keypoints (FloatTensor[N, K, 3]): the K keypoints location for each of the N instances, in the format [x, y, visibility], where visibility=0 means that the keypoint is not visible.

Be careful : keypoints visibility value seems to be not correct. 1 means the point is visible, 0 means the point is invisible(hidden human parts).

If you input image inference to the network model(keypointrcnn_resnet50_fpn), you can get output dictionary value. In this code, 'for loop' iterates for human counts, and prints keypoints and keypoint_score values.

out = model([input_img])[0]
for box, score, keypoints, kscores in zip(out['boxes'], out['scores'], out['keypoints'], out['keypoints_scores'] ):
    score = score.cpu().detach().numpy()
    box = box.cpu().detach().numpy()
    points = keypoints.cpu().detach().numpy()
    kscores = kscores.cpu().detach().numpy()
    print(kscores)
    print(points)

keypoints_score

Some images may only show the torso or parts of the body like this I used in my article "Human Pose Estimation using OpenPose."

Let's check the keypoints_score and keypoints of this picture. Run this command.

python3 keypoints_gpu.py --image=imgs/COCO_294.jpg

You can get the console outputs like this.

[12.644188   13.438369   14.29085    11.607738   13.448702    4.9779096
  7.0458913   7.259503   10.004989    7.2911468   9.224331    0.04336043
  1.5927595  -0.7652377   1.4325492  -1.3891729  -1.781479  ]
  
  
[[110.76687   99.818794   1.      ]
 [121.55979   90.219604   1.      ]
 [ 99.57419   89.81964    1.      ]
 [134.35143   99.41882    1.      ]
 [ 81.18623   98.6189     1.      ]
 [100.77341  145.01497    1.      ]
 [ 83.58466  143.81506    1.      ]
 [169.12865  258.20538    1.      ]
 [174.72498  208.60957    1.      ]
 [246.27812  251.40593    1.      ]
 [215.89803  151.41441    1.      ]
 [100.37367  302.20163    1.      ]
 [ 79.98702  303.8015     1.      ]
 [224.29253  193.0109     1.      ]
 [147.54279  221.4085     1.      ]
 [249.47603  251.40593    1.      ]
 [247.07762  252.20589    1.      ]]

Let's compare the keypoints_score and it's name. The score at the lower wrist is very low. If you see the above picture, you might understand enough.

 KeyPoint Names      Keypoint Scores
'nose',                 12.644188
'left_eye',             13.438369
'right_eye',            14.29085
'left_ear',             11.607738
'right_ear',            13.448702
'left_shoulder',        4.9779096
'right_shoulder',       7.0458913
'left_elbow',           7.259503
'right_elbow',          10.004989
'left_wrist',           7.2911468
'right_wrist',          9.224331
'left_hip',             0.04336043
'right_hip',            1.5927595
'left_knee',            -0.7652377
'right_knee',           1.4325492
'left_ankle',           -1.3891729
'right_ankle'           -1.781479

I can't find these values range explanation. If you know a description of the range of this value, let me know.
You can set threshold values near 3. If keypoint_score value is below this threshold, discard the keypoint values of that index.

Be careful : As you cane see, the man's left should(index 5) is hidden. So keypoint visibility value of left_shoulder should be 0. But no.... Perhaps future version of this model(keypointrcnn_resnet50_fpn) might modify this bug. But now, you must not use the visibility value.

I set the threshould value to 3.5 in my python code.

I intentionally omit lines connecting the low score keypoints. If you do not consider keypoint scores, the following picture will be created. I set the threshould value to -3.5 in my python code.

FPS check

https://github.com/kairess/torchvision_walkthrough.git provides sample code(video_keypoints.py) for video file keypoint detection. I modified video_keypoints.py to display fps and to speed up processing using cuda. I set the video width 480 pixel. If you change the output video size, the fps might change.

import torch
import torchvision
from torchvision import models
import torchvision.transforms as T
import time
import cv2
import numpy as np
import gc 
import sys

print('pytorch', torch.__version__)
print('torchvision', torchvision.__version__)

IMG_SIZE = 480
THRESHOLD = 0.7
fps_time = 0

def process_frame(img):
  #out = None
  torch.cuda.empty_cache()
  gc.collect()
  fps_time  = time.perf_counter()
  img = cv2.resize(img, (IMG_SIZE, int(img.shape[0] * IMG_SIZE / img.shape[1])))
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

  trf = T.Compose([
      T.ToTensor()
  ])

  input_tensor = trf(img)
  input_img = [input_tensor.to(device)]
  out = model(input_img)[0]


  print(len(out['boxes']))
  for box, score, keypoints in zip(out['boxes'], out['scores'], out['keypoints']):
    score_np = score.cpu().detach().numpy()
    print(score_np)
    if score_np < THRESHOLD:
      continue

    box_np = box.to(torch.int16).cpu().numpy()
    keypoints_np = keypoints.to(torch.int16).cpu().numpy()[:, :2]

    cv2.rectangle(img, pt1=(int(box_np[0]), int(box_np[1])), pt2=(int(box_np[2]), int(box_np[3])), thickness=2, color=(0, 0, 255))

    for k in keypoints_np:
      cv2.circle(img, center=tuple(k.astype(int)), radius=2, color=(255, 0, 0), thickness=-1)

    cv2.polylines(img, pts=[keypoints_np[5:10:2].astype(int)], isClosed=False, color=(255, 0, 0), thickness=2)
    cv2.polylines(img, pts=[keypoints_np[6:11:2].astype(int)], isClosed=False, color=(255, 0, 0), thickness=2)
    cv2.polylines(img, pts=[keypoints_np[11:16:2].astype(int)], isClosed=False, color=(255, 0, 0), thickness=2)
    cv2.polylines(img, pts=[keypoints_np[12:17:2].astype(int)], isClosed=False, color=(255, 0, 0), thickness=2)

  fps = 1.0 / (time.perf_counter() - fps_time)
  new_img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
  cv2.putText(new_img , "FPS: %f" % (fps), (10, 20),  cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
  out_video.write(new_img)
  input_tensor.cpu()


if torch.cuda.is_available():
  device = torch.device('cuda')
  model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
  model = model.to(device)
  model.eval()
else:
  device = torch.device('cpu')
  model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True).eval()


cap = cv2.VideoCapture('imgs/02.mp4')
ret, img = cap.read()
fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
out_video = cv2.VideoWriter('imgs/output.mp4', fourcc, cap.get(cv2.CAP_PROP_FPS), (IMG_SIZE, int(img.shape[0] * IMG_SIZE / img.shape[1])))

count = 1
 
while cap.isOpened():
  ret, img = cap.read()
  if ret == False:
      break
  process_frame(img)

  sys.stdout.flush ()
  print('Frame count[%d]'%count)
  count += 1
  
out_video.release()
cap.release()

<video_gpu.py>

AMD Ryzen 7 2700X + RTX 2070 + Ubuntu 18.04

This captured image is part of the video made from the workstation(AMD Ryzen 2700X(64GB DDR4), Nvidia RTX2070 GPU, Ubuntu 18.04 OS). As you can see, the FPS is around 8 ~ 10 frames.

Jetson Nano

This captured image is part of the video made from the Jetson Nano. As you can see, the FPS is around 0.3 frames. I think this fps is too poor to apply for your realtime projects.

Wrapping Up

Torchvision's pose estimation performance is very poor on the Jetson Nano. I'll test the same torchvision's pose estimation on the Jetson TX2 soon.
If you want the most satisfactory human pose estimation performance on Jetson Nano, see the following article(https://spyjetson.blogspot.com/2019/12/jetsonnano-human-pose-estimation-using.html). NVIDIA team introduces human pose estimation using models optimized for TensorRT.

NVIDIA Jetson and Raspberry Pi

2019년 10월 16일 수요일

JetsonNano - Human Pose estimation using Pytorch, torchvision