In the previous post, I used https://github.com/ildoonet/tf-pose-estimation, which implements PoseEstimation using the mobilenet model in Tensorflow, in Xavier NX. The mobilenet model used by ildoonet exhibits satisfactory performance over 15FPS because it is light. TensorRT option can be used in Xavier NX because it can achieve 30FPS performance, but there is no problem in terms of speed. However, mobilenet focused on improving speed in order to be able to operate on mobile devices such as smartphones. However, there is a side effect of poor accuracy. In AI models, performance and accuracy are mostly inversely related. You must either give up accuracy or performance to suit your application or compromise at the right point. In this article, I will introduce a model that can be used in Tensorflow using ResNet with high accuracy.

This article summarizes the contents of https://github.com/eldar/pose-tensorflow.

Prerequisites

Before you build "ildoonet/tf-pose-estimation", you must pre install these packages.

OpenCV : JetPack 4.3 and later versions have OpenCV installed. Therefore, there is no need to install OpenCV. Xavier NX has JetPack 4.4 or higher installed..
Tensorflow : https://spyjetson.blogspot.com/2020/07/jetson-xavier-nx-python-virtual.html explains how to use the Python virtual environment and install TensorFlow.

spypiggy@XavierNX:~$sudo apt-get install python3-tk
spypiggy@XavierNX:~$ source /home/spypiggy/python/bin/activate
(python) spypiggy@XavierNX:~$pip3 install easydict munkres
(python) spypiggy@XavierNX:~$pip3 install scikit-image pillow pyyaml matplotlib cython

Download and build code from eldar

Now clone eldar's github.

(python) spypiggy@XavierNX:~$ cd src
(python) spypiggy@XavierNX:~/src$ git clone https://github.com/eldar/pose-tensorflow.git

#for multiperson models (nms_grid module compile this)
(python) spypiggy@XavierNX:~/src$ cd pose-tensorflow/
(python) spypiggy@XavierNX:~/src/pose-tensorflow$ ./compile.sh
#Download models
(python) spypiggy@XavierNX:~/src/pose-tensorflow$ cd models/mpii/
(python) spypiggy@XavierNX:~/src/pose-tensorflow/models/mpii$ ./download_models.sh
(python) spypiggy@XavierNX:~/src/pose-tensorflow/models/mpii$ cd ../coco/
(python) spypiggy@XavierNX:~/src/pose-tensorflow/models/coco$ ./download_models.sh

Before proceeding with the test, some modifications to the source code are required.

The following scipy code is used in eldar source code. However, these functions are no longer supported as of version 1.3.0rc1 of scipy. Therefore, these functions should be replaced with those of PIL.

I uploaded the changed python file to my github. You can overwrite the file and use it.

However, the _npcircle function in the util/visualize.py file is not a complete conversion. For simplicity, functions such as transparency adjustment are omitted.

## Image read Conversion to PIL
#image = imread(file_name, mode='RGB')
image = Image.open(file_name).convert('RGB')

## Image draw Conversion to PIL
def _npcircle(image, cx, cy, radius, color, transparency=0.0):
    draw = ImageDraw.Draw(image)
    clr = (color[0], color[1], color[2]) #array ->tuple
    draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=2)
    return image

    """Draw a circle on an image using only numpy methods."""
    '''
    radius = int(radius)
    cx = int(cx)
    cy = int(cy)
    y, x = np.ogrid[-radius: radius, -radius: radius]
    index = x**2 + y**2 <= radius**2
    
    image = np.asarray(image, dtype="uint8")
    image[cy-radius:cy+radius, cx-radius:cx+radius][index] = (
        image[cy-radius:cy+radius, cx-radius:cx+radius][index].astype('float32') * transparency +
        np.array(color).astype('float32') * (1.0 - transparency)).astype('uint8')
    '''

Models

The models provided on this page include the mpii model for single person recognition created by mpii (max planck institut informatik) and the coco model for multi-person recognition. Both models use ResNet-101, so the accuracy is quite good.

For more information on MPII Human Pose Models, please visit our website at https://pose.mpi-inf.mpg.de/.

Testing with mpii model

Testing uses singleperson.py file in the demo directory. However, it uses scipy's imread function, which is no longer supported by the original file. This part needs correction. Then draw KeyPoints using the package's Visualizer. However, since I want to draw directly using PIL, I modified it to draw this part directly. The following is a modified singleperson.py. Another advantage is that the code is very concise and easy to understand.

import os
import sys

sys.path.append(os.path.dirname(__file__) + "/../")

#from scipy.misc import imread
from PIL import Image, ImageDraw, ImageFont
import time
from util.config import load_config
from nnet import predict
from util import visualize
from dataset.pose_dataset import data_to_input
import argparse


def draw_mpii_points(image, pose):
    fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc'
    fnt = ImageFont.truetype(fontname, 15)
    draw = ImageDraw.Draw(image)
    radius = 3
    clr = (0,255,0)
    for i in range(len(pose)):
        p = pose[i]
        cx = p[0]
        cy = p[1]
        accuracy = p[2]
        draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3)
        draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255))
    
    #draw
    #all_joints: [[0, 5], [1, 4], [2, 3], [6, 11], [7, 10], [8, 9], [12], [13]]
    #all_joints_names: ['ankle', 'knee', 'hip', 'wrist', 'elbow', 'shoulder', 'chin', 'forehead']    
    #draw Rankle -> RKnee (0-> 1)
    if all(pose[0]) and all(pose[1]):
        draw.line([tuple(pose[0][:2]), tuple(pose[1][:2])],width = 2, fill=(255,255,0))
    #draw RKnee -> Rhip (1-> 2)
    if all(pose[1]) and all(pose[2]):
        draw.line([tuple(pose[1][:2]), tuple(pose[2][:2])],width = 2, fill=(255,255,0))
    #draw Rhip -> Lhip (2-> 3)
    if all(pose[2]) and all(pose[3]):
        draw.line([tuple(pose[2][:2]), tuple(pose[3][:2])],width = 2, fill=(255,255,0))
    #draw Lhip -> Lknee (3-> 4)
    if all(pose[3]) and all(pose[4]):
        draw.line([tuple(pose[3][:2]), tuple(pose[4][:2])],width = 2, fill=(255,255,0))
    #draw Lknee -> Lankle (4-> 5)
    if all(pose[4]) and all(pose[5]):
        draw.line([tuple(pose[4][:2]), tuple(pose[5][:2])],width = 2, fill=(255,255,0))

    #draw Rwrist -> Relbow (6-> 7)
    if all(pose[6]) and all(pose[7]):
        draw.line([tuple(pose[6][:2]), tuple(pose[7][:2])],width = 2, fill=(255,255,0))
    #draw Relbow -> Rshoulder (7-> 8)
    if all(pose[7]) and all(pose[8]):
        draw.line([tuple(pose[7][:2]), tuple(pose[8][:2])],width = 2, fill=(255,255,0))
    #draw Rshoulder -> Lshoulder (8-> 9)
    if all(pose[8]) and all(pose[9]):
        draw.line([tuple(pose[8][:2]), tuple(pose[9][:2])],width = 2, fill=(255,255,0))
    #draw Lshoulder -> Lelbow (9-> 10)
    if all(pose[9]) and all(pose[10]):
        draw.line([tuple(pose[9][:2]), tuple(pose[10][:2])],width = 2, fill=(255,255,0))
    #draw Lelbow -> Lwrist (10-> 11)
    if all(pose[10]) and all(pose[11]):
        draw.line([tuple(pose[10][:2]), tuple(pose[11][:2])],width = 2, fill=(255,255,0))

    #draw chin -> forehead (12-> 13)
    if all(pose[12]) and all(pose[13]):
        draw.line([tuple(pose[12][:2]), tuple(pose[13][:2])],width = 2, fill=(255,255,0))

    #draw chin -> Rshoulder (12-> 8)
    if all(pose[12]) and all(pose[8]):
        draw.line([tuple(pose[12][:2]), tuple(pose[8][:2])],width = 2, fill=(255,255,0))

    #draw chin -> Lshoulder (12-> 9)
    if all(pose[12]) and all(pose[9]):
        draw.line([tuple(pose[12][:2]), tuple(pose[9][:2])],width = 2, fill=(255,255,0))

    #draw Rshoulder -> Rhip (8-> 2)
    if all(pose[8]) and all(pose[2]):
        draw.line([tuple(pose[8][:2]), tuple(pose[2][:2])],width = 2, fill=(255,255,0))
    #draw Lshoulder -> Lhip (9-> 3)
    if all(pose[9]) and all(pose[3]):
        draw.line([tuple(pose[9][:2]), tuple(pose[3][:2])],width = 2, fill=(255,255,0))
        
    image.save('./single_mpii_result.png')

parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example")
parser.add_argument("--image", type=str, default = "demo/image.png", help="image file name")
args = parser.parse_args()

cfg = load_config("demo/pose_cfg.yaml")

# Load and setup CNN part detector
sess, inputs, outputs = predict.setup_pose_prediction(cfg)

# Read image from file
#image = imread(file_name, mode='RGB')
image = Image.open(args.image).convert('RGB')

image_batch = data_to_input(image)

start = time.time()
# Compute prediction with the CNN
outputs_np = sess.run(outputs, feed_dict={inputs: image_batch})
scmap, locref, _ = predict.extract_cnn_output(outputs_np, cfg)
# Extract maximum scoring location from the heatmap, assume 1 person
pose = predict.argmax_pose_predict(scmap, locref, cfg.stride)
end = time.time()
print('===== Net FPS :%f ====='%( 1 / (end - start))) 
print(pose)
draw_mpii_points(image, pose)
end = time.time()
print('===== FPS :%f ====='%( 1 / (end - start))) 

# Visualise
#visualize.show_heatmaps(cfg, image, scmap, pose)
#visualize.waitforbuttonpress()

<singleperson.py>

You can test the framework with images like this.

(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/singleperson.py

...
===== Net FPS :0.036074 =====
[[135.67195415 445.9567318    0.95374119]
 [163.71490151 382.52134919   0.9635005 ]
 [169.98097157 302.69817305   0.90425462]
 [204.34468675 298.58998299   0.90001243]
 [201.57585382 389.34515202   0.920187  ]
 [176.91971612 453.19376218   0.98083913]
 [ 65.59110856 285.69996548   0.96955967]
 [105.54865456 254.32479572   0.97965389]
 [144.6519146  219.47418821   0.87828785]
 [194.27917898 180.16952927   0.87499219]
 [212.95916617 120.75187397   0.95590681]
 [203.37560266  61.67314732   0.94463277]
 [170.00758362 193.16299033   0.96691549]
 [163.23280644 140.34924659   0.96262753]]
===== FPS :0.035663 =====

The FPS 0.035663 is now worth worrying about. I repeat it several times, but after loading the network model, the first inference task always takes a lot of time. It is accurate to measure the inference processing time after the second. We will check the correct FPS value while processing the video later.

And the output list value is information about the key point found in the mpii model. It is the coordinate x, y and probability value of 14 points from 0 to 13.

<single_mpii_result.png>

Testing with coco model

Testing uses demo_multiperson.py file in the demo directory. Like singleperson.py, demo_multiperson.py file is partially modified and used.

import os
import sys

import numpy as np

sys.path.append(os.path.dirname(__file__) + "/../")

#from scipy.misc import imread, imsave
from PIL import Image, ImageDraw, ImageFont
import time

from util.config import load_config
from dataset.factory import create as create_dataset
from nnet import predict
from util import visualize
from dataset.pose_dataset import data_to_input

from multiperson.detections import extract_detections
from multiperson.predict import SpatialModel, eval_graph, get_person_conf_multicut
import argparse

#from multiperson.visualize import PersonDraw, visualize_detections
#import matplotlib.pyplot as plt

'''
Total 17 points in COCO 
'''
def validate_coco_pose(pose):
    err = 0
    for p in pose:
        if p[0] < 0.1 or p[1] < 0.1 :
            err += 1
    if err > 8 : 
        return False
    return True
    
def draw_coco_points(image, persons):
    fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc'
    fnt = ImageFont.truetype(fontname, 15)
    draw = ImageDraw.Draw(image)
    radius = 3
    clr = (0,255,0)
    draw_person = 0
    thickness = 3
    for j in range(len(persons)):
        pose = persons[j]
        if False == validate_coco_pose(pose):
            continue
        draw_person += 1    
        #if j < 11:
        #    continue
        for i in range(len(pose)):
            p = pose[i]
            cx = p[0]
            cy = p[1]
            if cx < 0.1 and cy < 0.1 :
                continue
            draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3)
            draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255))
    
        #draw nose -> REye (0-> 2)
        if all(pose[0]) and all(pose[2]):
            draw.line([tuple(pose[0]), tuple(pose[2])],width = thickness, fill=(219,0,219))
        #draw nose -> LEye (0-> 1)
        if all(pose[0]) and all(pose[1]):
            draw.line([tuple(pose[0]), tuple(pose[1])],width = thickness, fill=(219,0,219))

        #draw LEye ->LEar(1-> 3)
        if all(pose[1]) and all(pose[3]):
            draw.line([tuple(pose[1]), tuple(pose[3])],width = thickness, fill=(219,0,219))
        #draw REye ->REar(2-> 4)
        if all(pose[2]) and all(pose[4]):
            draw.line([tuple(pose[2]), tuple(pose[4])],width = thickness, fill=(219,0,219))

        #draw RShoulder ->RHip(6-> 12)
        if all(pose[6]) and all(pose[12]):
            draw.line([tuple(pose[6]), tuple(pose[12])],width = thickness, fill=(153,0,51))
        #draw LShoulder ->LHip(5-> 11)
        if all(pose[5]) and all(pose[11]):
            draw.line([tuple(pose[5]), tuple(pose[11])],width = thickness, fill=(153,0,51))

        #draw RShoulder -> LShoulder (6-> 5)
        if all(pose[6]) and all(pose[5]):
            draw.line([tuple(pose[6]), tuple(pose[5])],width = thickness, fill=(255,102,51))

        #draw RShoulder -> RElbow(6-> 8)
        if all(pose[6]) and all(pose[8]):
            draw.line([tuple(pose[6]), tuple(pose[8])],width = thickness, fill=(255,255,51))
        #draw RElbow -> RWrist (8 ->10)
        if all(pose[8]) and all(pose[10]):
            draw.line([tuple(pose[8]), tuple(pose[10])],width = thickness, fill=(255,255,51))

        #draw LShoulder -> LElbow (5-> 7 )
        if all(pose[5]) and all(pose[7]):
            draw.line([tuple(pose[5]), tuple(pose[7])],width = thickness, fill=(51,255,51))
        #draw LElbow -> LWrist (7 ->9)
        if all(pose[7]) and all(pose[9]):
            draw.line([tuple(pose[7]), tuple(pose[9])],width = thickness, fill=(51,255,51))

        #draw RHip -> RKnee (12 ->14)
        if all(pose[12]) and all(pose[14]):
            draw.line([tuple(pose[12]), tuple(pose[14])],width = thickness, fill=(51,102,51))
        #draw RKnee -> RFoot (14 ->16)
        if all(pose[14]) and all(pose[16]):
            draw.line([tuple(pose[14]), tuple(pose[16])],width = thickness, fill=(51,102,51))

        #draw LHip -> LKnee(11 ->13)
        if all(pose[11]) and all(pose[13]):
            draw.line([tuple(pose[11]), tuple(pose[13])],width = thickness, fill=(51,51,204))
        #draw LKnee -> LFoot (13 ->15)
        if all(pose[13]) and all(pose[15]):
            draw.line([tuple(pose[13]), tuple(pose[15])],width = thickness, fill=(51,51,204))
    
    return image, draw_person


parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example")
parser.add_argument("--image", type=str, default = "demo/image_multi.png", help="image file name")
args = parser.parse_args()

cfg = load_config("demo/pose_cfg_multi.yaml")

dataset = create_dataset(cfg)

sm = SpatialModel(cfg)
sm.load()

#draw_multi = PersonDraw()

# Load and setup CNN part detector
sess, inputs, outputs = predict.setup_pose_prediction(cfg)

# Read image from file
file_name = args.image
#image = imread(file_name, mode='RGB')
image = Image.open(file_name).convert('RGB')
image_batch = data_to_input(image)
start = time.time()
# Compute prediction with the CNN
outputs_np = sess.run(outputs, feed_dict={inputs: image_batch})
scmap, locref, pairwise_diff = predict.extract_cnn_output(outputs_np, cfg, dataset.pairwise_stats)

detections = extract_detections(cfg, scmap, locref, pairwise_diff)
unLab, pos_array, unary_array, pwidx_array, pw_array = eval_graph(sm, detections)
person_conf_multi = get_person_conf_multicut(sm, unLab, unary_array, pos_array)
end = time.time()
print(person_conf_multi)
print('===== Net FPS :%f ====='%( 1 / (end - start)))
image, draw_person = draw_coco_points(image, person_conf_multi)
image.save('./multi_coco_result[%d].png'%draw_person)

end = time.time()
print('===== FPS :%f ====='%( 1 / (end - start))) 

'''
img = np.copy(image)
visim_multi = img.copy()
fig = plt.imshow(visim_multi)
draw_multi.draw(visim_multi, dataset, person_conf_multi)
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)

plt.show()
visualize.waitforbuttonpress()
'''

<demo_multiperson.py>

You can test the framework with images like this.

(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/demo_multiperson.py
.....
num_people:  19
[[[ 66.20384562 101.43434691]
  [ 69.89566374  95.22151852]
  [ 59.09224147  95.70417881]
  [ 72.68740892  98.12449586]
  [ 45.67895818  98.08686912]
  [ 81.53892303 124.13856415]
  [ 33.19268489 124.87683523]
  [ 86.51360297 156.70794803]
  [  0.           0.        ]
  [ 90.48762476 189.30409873]
  [  0.           0.        ]
  [ 77.88260317 197.96489549]
  [ 60.26689658 197.26194894]
  [  0.           0.        ]
  [ 73.94066644 259.91335434]
  [  0.           0.        ]
  [  0.           0.        ]]
.......

 [[  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]
  [547.84042802 252.4423849 ]]]
===== Net FPS :0.031365 =====
===== FPS :0.030672 =====

It can be seen that the multiperson model provides only coordinate values, not probability values. And you can check some of the data with most keypoint coordinates as zero. Most of these data are not human data. Therefore, it is recommended to exclude such data. After setting the threshold, exclude those with unnecessary keypoints. When I tested the multiperson model, I found that this error often occurs. In the source code above, the validate_coco_pose function plays this role. If the number of non-measurable key points exceeds 8, the person's data is ignored.

And the last valid number is indicated in the file name. The model judged that there were a total of 19 people, but it was the result of processing using only 11 data, excluding 8 data with poor coordinates.

<multi_coco_result[11].png>

Accuracy comparison with mobilenet

The following pictures are resnet-101, mobilenet-thin, mobilenet-v2-small, mobilenet-v2-large, and images of the test results. mobilenet is https://spyjetson.blogspot.com/2019/09/jetsonnano-human-pose-estimation-using.html and https://spyjetson.blogspot.com/2020/07/xavier-nx-human-pose-estimation-using.html. It is made with the contents introduced in estimation-using.html.

A few example images show that resnet-101 is more accurate.

resnet-101 test images

mobilenet-thin test images

mobilenet-v2-small test images

mobilenet-v2-large test images

Video file test and check the FPS

This time I will check the FPS value while processing the video file frame. The processing speed is affected by the size of the inference image. Please consider that adjusting the image file size will change the FPS value.

import os
import sys
import numpy as np

sys.path.append(os.path.dirname(__file__) + "/../")

#from scipy.misc import imread, imsave
from PIL import Image, ImageDraw, ImageFont
import cv2
import time

from util.config import load_config
from dataset.factory import create as create_dataset
from nnet import predict
from util import visualize
from dataset.pose_dataset import data_to_input

from multiperson.detections import extract_detections
from multiperson.predict import SpatialModel, eval_graph, get_person_conf_multicut
import argparse

sample_dir = '/home/spypiggy/src/test_images/'

parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example")
parser.add_argument("--video", type=str, default = sample_dir + "video.avi", help="video file name")
parser.add_argument("--res", type=str, default = "640x320", help="video file resolution")
args = parser.parse_args()

res = args.res.split('x')
inference_w, inference_h = int(res[0]), int(res[1])

'''
Total 17 points in COCO 
'''
def validate_coco_pose(pose):
    err = 0
    for p in pose:
        if p[0] < 0.1 or p[1] < 0.1 :
            err += 1
    if err > 8 : 
        return False
    return True
    
def draw_coco_points(image, persons):
    fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc'
    fnt = ImageFont.truetype(fontname, 15)
    draw = ImageDraw.Draw(image)
    radius = 3
    clr = (0,255,0)
    draw_person = 0
    thickness = 3
    for j in range(len(persons)):
        pose = persons[j]
        if False == validate_coco_pose(pose):
            continue
        draw_person += 1    
        #if j < 11:
        #    continue
        for i in range(len(pose)):
            p = pose[i]
            cx = p[0]
            cy = p[1]
            if cx < 0.1 and cy < 0.1 :
                continue
            draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3)
            draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255))
    
        #draw nose -> REye (0-> 2)
        if all(pose[0]) and all(pose[2]):
            draw.line([tuple(pose[0]), tuple(pose[2])],width = thickness, fill=(219,0,219))
        #draw nose -> LEye (0-> 1)
        if all(pose[0]) and all(pose[1]):
            draw.line([tuple(pose[0]), tuple(pose[1])],width = thickness, fill=(219,0,219))

        #draw LEye ->LEar(1-> 3)
        if all(pose[1]) and all(pose[3]):
            draw.line([tuple(pose[1]), tuple(pose[3])],width = thickness, fill=(219,0,219))
        #draw REye ->REar(2-> 4)
        if all(pose[2]) and all(pose[4]):
            draw.line([tuple(pose[2]), tuple(pose[4])],width = thickness, fill=(219,0,219))

        #draw RShoulder ->RHip(6-> 12)
        if all(pose[6]) and all(pose[12]):
            draw.line([tuple(pose[6]), tuple(pose[12])],width = thickness, fill=(153,0,51))
        #draw LShoulder ->LHip(5-> 11)
        if all(pose[5]) and all(pose[11]):
            draw.line([tuple(pose[5]), tuple(pose[11])],width = thickness, fill=(153,0,51))

        #draw RShoulder -> LShoulder (6-> 5)
        if all(pose[6]) and all(pose[5]):
            draw.line([tuple(pose[6]), tuple(pose[5])],width = thickness, fill=(255,102,51))

        #draw RShoulder -> RElbow(6-> 8)
        if all(pose[6]) and all(pose[8]):
            draw.line([tuple(pose[6]), tuple(pose[8])],width = thickness, fill=(255,255,51))
        #draw RElbow -> RWrist (8 ->10)
        if all(pose[8]) and all(pose[10]):
            draw.line([tuple(pose[8]), tuple(pose[10])],width = thickness, fill=(255,255,51))

        #draw LShoulder -> LElbow (5-> 7 )
        if all(pose[5]) and all(pose[7]):
            draw.line([tuple(pose[5]), tuple(pose[7])],width = thickness, fill=(51,255,51))
        #draw LElbow -> LWrist (7 ->9)
        if all(pose[7]) and all(pose[9]):
            draw.line([tuple(pose[7]), tuple(pose[9])],width = thickness, fill=(51,255,51))

        #draw RHip -> RKnee (12 ->14)
        if all(pose[12]) and all(pose[14]):
            draw.line([tuple(pose[12]), tuple(pose[14])],width = thickness, fill=(51,102,51))
        #draw RKnee -> RFoot (14 ->16)
        if all(pose[14]) and all(pose[16]):
            draw.line([tuple(pose[14]), tuple(pose[16])],width = thickness, fill=(51,102,51))

        #draw LHip -> LKnee(11 ->13)
        if all(pose[11]) and all(pose[13]):
            draw.line([tuple(pose[11]), tuple(pose[13])],width = thickness, fill=(51,51,204))
        #draw LKnee -> LFoot (13 ->15)
        if all(pose[13]) and all(pose[15]):
            draw.line([tuple(pose[13]), tuple(pose[15])],width = thickness, fill=(51,51,204))
    
    return image, draw_person
    
cfg = load_config("demo/pose_cfg_multi.yaml")
dataset = create_dataset(cfg)
sm = SpatialModel(cfg)
sm.load()
# Load and setup CNN part detector
sess, inputs, outputs = predict.setup_pose_prediction(cfg)
    
    
cap = cv2.VideoCapture(args.video)
if cap is None:
    print("Video[%s] Open Error"%(args.video))
    sys.exit(0)

ret_val, img = cap.read()
if ret_val == False:
    print('No valid video frame')
    sys.exit(0)
height, width, _ = img.shape    
fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
out_video = cv2.VideoWriter('/tmp/resnet_output.mp4', fourcc, cap.get(cv2.CAP_PROP_FPS), (inference_w, inference_h))
    
count = 0
t_netfps_time = 0
t_fps_time = 0
start = time.time()
while cap.isOpened():
    ret_val, dst = cap.read()
    if ret_val == False:
        print("Frame read End")
        break
    image = Image.fromarray(np.uint8(dst))  #OpenCV format -> PIL Format
    image = image.resize((inference_w, inference_h))
    image_batch = data_to_input(image)
    net_start = time.time()
    # Compute prediction with the CNN
    outputs_np = sess.run(outputs, feed_dict={inputs: image_batch})
    scmap, locref, pairwise_diff = predict.extract_cnn_output(outputs_np, cfg, dataset.pairwise_stats)
    detections = extract_detections(cfg, scmap, locref, pairwise_diff)
    unLab, pos_array, unary_array, pwidx_array, pw_array = eval_graph(sm, detections)
    person_conf_multi = get_person_conf_multicut(sm, unLab, unary_array, pos_array)
    net_end = time.time()
    netfps = 1.0 / (net_end - net_start)
    print('Frame[%d] ===== Net FPS :%f ====='%(count + 1,  netfps))
    image, draw_person = draw_coco_points(image, person_conf_multi)
    img = np.asarray(image, dtype="uint8") #PIL Format -> OpenCV format
    fps = 1.0 / (time.time() - start)
    cv2.putText(img , "FPS[%4.1f] NET_FPS[%4.1f]"%(fps, netfps), (20, 40),  cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    out_video.write(img)
    start = time.time()
    t_netfps_time += netfps
    t_fps_time += fps    
    count += 1

print("==== Summary ====")
print("Video Resolution W[%d] H[%d] -> inference size W[%d] H[%d]"%(width, height, inference_w, inference_h))
if count:
    print("avg fps[%f] avg net_fps[%f]"%(t_fps_time / count, t_netfps_time / count))

cv2.destroyAllWindows()
out_video.release()
cap.release()

<video_multiperson.py>

For testing, we will use the video file used by OpenPose.

(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/video_multiperson.py --video='../test_images/video.avi' --res=432x368
......
==== Summary ====
Video Resolution W[1280] H[720] -> inference size W[640] H[480]
avg fps[1.478159] avg net_fps[1.709562]

KeyPoint detection was performed by converting the resolution (1280X720) of the original video file to Inference size (432X368). Roughly, 2.7 FPS performance can be obtained.

I had tested the same video file using Mobilenet in a previous blog. Let's compare the results using MobileNet and ResNet-101.

<mobilenet_thin result video>

<resnet-101 result video>

Wrapping Up

As you can see from the picture, you can see that resnet-101 detects KeyPoint much more accurately. However, ResNet-101 is relatively accurate and the model is heavy, so the processing speed is relatively low. A performance of about 1.7 FPS can be obtained. If the inference size is reduced to about 480X320, the performance of 2.7 to 3 FPS can be obtained. However, if the image is reduced too much, KeyPoint detection performance may decrease, so it is good to determine the proper inference size through testing.

You can download the source code at https://github.com/raspberry-pi-maker/NVIDIA-Jetson .

NVIDIA Jetson and Raspberry Pi

2020년 8월 1일 토요일

Jetson Xavier NX - Human Pose estimation using tensorflow (mpii)