In the previous post, I used https://github.com/ildoonet/tf-pose-estimation, which implements PoseEstimation using the mobilenet model in Tensorflow, in Xavier NX. The mobilenet model used by ildoonet exhibits satisfactory performance over 15FPS because it is light. TensorRT option can be used in Xavier NX because it can achieve 30FPS performance, but there is no problem in terms of speed. However, mobilenet focused on improving speed in order to be able to operate on mobile devices such as smartphones. However, there is a side effect of poor accuracy. In AI models, performance and accuracy are mostly inversely related. You must either give up accuracy or performance to suit your application or compromise at the right point. In this article, I will introduce a model that can be used in Tensorflow using ResNet with high accuracy.
This article summarizes the contents of https://github.com/eldar/pose-tensorflow.
Prerequisites
Before you build "ildoonet/tf-pose-estimation", you must pre install these packages.
- OpenCV : JetPack 4.3 and later versions have OpenCV installed. Therefore, there is no need to install OpenCV. Xavier NX has JetPack 4.4 or higher installed..
- Tensorflow : https://spyjetson.blogspot.com/2020/07/jetson-xavier-nx-python-virtual.html explains how to use the Python virtual environment and install TensorFlow.
spypiggy@XavierNX:~$sudo apt-get install python3-tk spypiggy@XavierNX:~$ source /home/spypiggy/python/bin/activate (python) spypiggy@XavierNX:~$pip3 install easydict munkres (python) spypiggy@XavierNX:~$pip3 install scikit-image pillow pyyaml matplotlib cython
Download and build code from eldar
Now clone eldar's github.
(python) spypiggy@XavierNX:~$ cd src (python) spypiggy@XavierNX:~/src$ git clone https://github.com/eldar/pose-tensorflow.git #for multiperson models (nms_grid module compile this) (python) spypiggy@XavierNX:~/src$ cd pose-tensorflow/ (python) spypiggy@XavierNX:~/src/pose-tensorflow$ ./compile.sh #Download models (python) spypiggy@XavierNX:~/src/pose-tensorflow$ cd models/mpii/ (python) spypiggy@XavierNX:~/src/pose-tensorflow/models/mpii$ ./download_models.sh (python) spypiggy@XavierNX:~/src/pose-tensorflow/models/mpii$ cd ../coco/ (python) spypiggy@XavierNX:~/src/pose-tensorflow/models/coco$ ./download_models.sh
Before proceeding with the test, some modifications to the source code are required.
The following scipy code is used in eldar source code. However, these functions are no longer supported as of version 1.3.0rc1 of scipy. Therefore, these functions should be replaced with those of PIL.
I uploaded the changed python file to my github. You can overwrite the file and use it.
However, the _npcircle function in the util/visualize.py file is not a complete conversion. For simplicity, functions such as transparency adjustment are omitted.
## Image read Conversion to PIL #image = imread(file_name, mode='RGB') image = Image.open(file_name).convert('RGB') ## Image draw Conversion to PIL def _npcircle(image, cx, cy, radius, color, transparency=0.0): draw = ImageDraw.Draw(image) clr = (color[0], color[1], color[2]) #array ->tuple draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=2) return image """Draw a circle on an image using only numpy methods.""" ''' radius = int(radius) cx = int(cx) cy = int(cy) y, x = np.ogrid[-radius: radius, -radius: radius] index = x**2 + y**2 <= radius**2 image = np.asarray(image, dtype="uint8") image[cy-radius:cy+radius, cx-radius:cx+radius][index] = ( image[cy-radius:cy+radius, cx-radius:cx+radius][index].astype('float32') * transparency + np.array(color).astype('float32') * (1.0 - transparency)).astype('uint8') '''
<Modified python codes >
Models
The models provided on this page include the mpii model for single person recognition created by mpii (max planck institut informatik) and the coco model for multi-person recognition. Both models use ResNet-101, so the accuracy is quite good.
<mpii keypoints coco keypoints>
For more information on MPII Human Pose Models, please visit our website at https://pose.mpi-inf.mpg.de/.
Testing with mpii model
Testing uses singleperson.py file in the demo directory. However, it uses scipy's imread function, which is no longer supported by the original file. This part needs correction. Then draw KeyPoints using the package's Visualizer. However, since I want to draw directly using PIL, I modified it to draw this part directly. The following is a modified singleperson.py. Another advantage is that the code is very concise and easy to understand.import os import sys sys.path.append(os.path.dirname(__file__) + "/../") #from scipy.misc import imread from PIL import Image, ImageDraw, ImageFont import time from util.config import load_config from nnet import predict from util import visualize from dataset.pose_dataset import data_to_input import argparse def draw_mpii_points(image, pose): fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc' fnt = ImageFont.truetype(fontname, 15) draw = ImageDraw.Draw(image) radius = 3 clr = (0,255,0) for i in range(len(pose)): p = pose[i] cx = p[0] cy = p[1] accuracy = p[2] draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3) draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255)) #draw #all_joints: [[0, 5], [1, 4], [2, 3], [6, 11], [7, 10], [8, 9], [12], [13]] #all_joints_names: ['ankle', 'knee', 'hip', 'wrist', 'elbow', 'shoulder', 'chin', 'forehead'] #draw Rankle -> RKnee (0-> 1) if all(pose[0]) and all(pose[1]): draw.line([tuple(pose[0][:2]), tuple(pose[1][:2])],width = 2, fill=(255,255,0)) #draw RKnee -> Rhip (1-> 2) if all(pose[1]) and all(pose[2]): draw.line([tuple(pose[1][:2]), tuple(pose[2][:2])],width = 2, fill=(255,255,0)) #draw Rhip -> Lhip (2-> 3) if all(pose[2]) and all(pose[3]): draw.line([tuple(pose[2][:2]), tuple(pose[3][:2])],width = 2, fill=(255,255,0)) #draw Lhip -> Lknee (3-> 4) if all(pose[3]) and all(pose[4]): draw.line([tuple(pose[3][:2]), tuple(pose[4][:2])],width = 2, fill=(255,255,0)) #draw Lknee -> Lankle (4-> 5) if all(pose[4]) and all(pose[5]): draw.line([tuple(pose[4][:2]), tuple(pose[5][:2])],width = 2, fill=(255,255,0)) #draw Rwrist -> Relbow (6-> 7) if all(pose[6]) and all(pose[7]): draw.line([tuple(pose[6][:2]), tuple(pose[7][:2])],width = 2, fill=(255,255,0)) #draw Relbow -> Rshoulder (7-> 8) if all(pose[7]) and all(pose[8]): draw.line([tuple(pose[7][:2]), tuple(pose[8][:2])],width = 2, fill=(255,255,0)) #draw Rshoulder -> Lshoulder (8-> 9) if all(pose[8]) and all(pose[9]): draw.line([tuple(pose[8][:2]), tuple(pose[9][:2])],width = 2, fill=(255,255,0)) #draw Lshoulder -> Lelbow (9-> 10) if all(pose[9]) and all(pose[10]): draw.line([tuple(pose[9][:2]), tuple(pose[10][:2])],width = 2, fill=(255,255,0)) #draw Lelbow -> Lwrist (10-> 11) if all(pose[10]) and all(pose[11]): draw.line([tuple(pose[10][:2]), tuple(pose[11][:2])],width = 2, fill=(255,255,0)) #draw chin -> forehead (12-> 13) if all(pose[12]) and all(pose[13]): draw.line([tuple(pose[12][:2]), tuple(pose[13][:2])],width = 2, fill=(255,255,0)) #draw chin -> Rshoulder (12-> 8) if all(pose[12]) and all(pose[8]): draw.line([tuple(pose[12][:2]), tuple(pose[8][:2])],width = 2, fill=(255,255,0)) #draw chin -> Lshoulder (12-> 9) if all(pose[12]) and all(pose[9]): draw.line([tuple(pose[12][:2]), tuple(pose[9][:2])],width = 2, fill=(255,255,0)) #draw Rshoulder -> Rhip (8-> 2) if all(pose[8]) and all(pose[2]): draw.line([tuple(pose[8][:2]), tuple(pose[2][:2])],width = 2, fill=(255,255,0)) #draw Lshoulder -> Lhip (9-> 3) if all(pose[9]) and all(pose[3]): draw.line([tuple(pose[9][:2]), tuple(pose[3][:2])],width = 2, fill=(255,255,0)) image.save('./single_mpii_result.png') parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example") parser.add_argument("--image", type=str, default = "demo/image.png", help="image file name") args = parser.parse_args() cfg = load_config("demo/pose_cfg.yaml") # Load and setup CNN part detector sess, inputs, outputs = predict.setup_pose_prediction(cfg) # Read image from file #image = imread(file_name, mode='RGB') image = Image.open(args.image).convert('RGB') image_batch = data_to_input(image) start = time.time() # Compute prediction with the CNN outputs_np = sess.run(outputs, feed_dict={inputs: image_batch}) scmap, locref, _ = predict.extract_cnn_output(outputs_np, cfg) # Extract maximum scoring location from the heatmap, assume 1 person pose = predict.argmax_pose_predict(scmap, locref, cfg.stride) end = time.time() print('===== Net FPS :%f ====='%( 1 / (end - start))) print(pose) draw_mpii_points(image, pose) end = time.time() print('===== FPS :%f ====='%( 1 / (end - start))) # Visualise #visualize.show_heatmaps(cfg, image, scmap, pose) #visualize.waitforbuttonpress()
<singleperson.py>
You can test the framework with images like this.
(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/singleperson.py ... ===== Net FPS :0.036074 ===== [[135.67195415 445.9567318 0.95374119] [163.71490151 382.52134919 0.9635005 ] [169.98097157 302.69817305 0.90425462] [204.34468675 298.58998299 0.90001243] [201.57585382 389.34515202 0.920187 ] [176.91971612 453.19376218 0.98083913] [ 65.59110856 285.69996548 0.96955967] [105.54865456 254.32479572 0.97965389] [144.6519146 219.47418821 0.87828785] [194.27917898 180.16952927 0.87499219] [212.95916617 120.75187397 0.95590681] [203.37560266 61.67314732 0.94463277] [170.00758362 193.16299033 0.96691549] [163.23280644 140.34924659 0.96262753]] ===== FPS :0.035663 =====
The FPS 0.035663 is now worth worrying about. I repeat it several times, but after loading the network model, the first inference task always takes a lot of time. It is accurate to measure the inference processing time after the second. We will check the correct FPS value while processing the video later.
And the output list value is information about the key point found in the mpii model. It is the coordinate x, y and probability value of 14 points from 0 to 13.
Testing with coco model
Testing uses demo_multiperson.py file in the demo directory. Like singleperson.py, demo_multiperson.py file is partially modified and used.import os import sys import numpy as np sys.path.append(os.path.dirname(__file__) + "/../") #from scipy.misc import imread, imsave from PIL import Image, ImageDraw, ImageFont import time from util.config import load_config from dataset.factory import create as create_dataset from nnet import predict from util import visualize from dataset.pose_dataset import data_to_input from multiperson.detections import extract_detections from multiperson.predict import SpatialModel, eval_graph, get_person_conf_multicut import argparse #from multiperson.visualize import PersonDraw, visualize_detections #import matplotlib.pyplot as plt ''' Total 17 points in COCO ''' def validate_coco_pose(pose): err = 0 for p in pose: if p[0] < 0.1 or p[1] < 0.1 : err += 1 if err > 8 : return False return True def draw_coco_points(image, persons): fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc' fnt = ImageFont.truetype(fontname, 15) draw = ImageDraw.Draw(image) radius = 3 clr = (0,255,0) draw_person = 0 thickness = 3 for j in range(len(persons)): pose = persons[j] if False == validate_coco_pose(pose): continue draw_person += 1 #if j < 11: # continue for i in range(len(pose)): p = pose[i] cx = p[0] cy = p[1] if cx < 0.1 and cy < 0.1 : continue draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3) draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255)) #draw nose -> REye (0-> 2) if all(pose[0]) and all(pose[2]): draw.line([tuple(pose[0]), tuple(pose[2])],width = thickness, fill=(219,0,219)) #draw nose -> LEye (0-> 1) if all(pose[0]) and all(pose[1]): draw.line([tuple(pose[0]), tuple(pose[1])],width = thickness, fill=(219,0,219)) #draw LEye ->LEar(1-> 3) if all(pose[1]) and all(pose[3]): draw.line([tuple(pose[1]), tuple(pose[3])],width = thickness, fill=(219,0,219)) #draw REye ->REar(2-> 4) if all(pose[2]) and all(pose[4]): draw.line([tuple(pose[2]), tuple(pose[4])],width = thickness, fill=(219,0,219)) #draw RShoulder ->RHip(6-> 12) if all(pose[6]) and all(pose[12]): draw.line([tuple(pose[6]), tuple(pose[12])],width = thickness, fill=(153,0,51)) #draw LShoulder ->LHip(5-> 11) if all(pose[5]) and all(pose[11]): draw.line([tuple(pose[5]), tuple(pose[11])],width = thickness, fill=(153,0,51)) #draw RShoulder -> LShoulder (6-> 5) if all(pose[6]) and all(pose[5]): draw.line([tuple(pose[6]), tuple(pose[5])],width = thickness, fill=(255,102,51)) #draw RShoulder -> RElbow(6-> 8) if all(pose[6]) and all(pose[8]): draw.line([tuple(pose[6]), tuple(pose[8])],width = thickness, fill=(255,255,51)) #draw RElbow -> RWrist (8 ->10) if all(pose[8]) and all(pose[10]): draw.line([tuple(pose[8]), tuple(pose[10])],width = thickness, fill=(255,255,51)) #draw LShoulder -> LElbow (5-> 7 ) if all(pose[5]) and all(pose[7]): draw.line([tuple(pose[5]), tuple(pose[7])],width = thickness, fill=(51,255,51)) #draw LElbow -> LWrist (7 ->9) if all(pose[7]) and all(pose[9]): draw.line([tuple(pose[7]), tuple(pose[9])],width = thickness, fill=(51,255,51)) #draw RHip -> RKnee (12 ->14) if all(pose[12]) and all(pose[14]): draw.line([tuple(pose[12]), tuple(pose[14])],width = thickness, fill=(51,102,51)) #draw RKnee -> RFoot (14 ->16) if all(pose[14]) and all(pose[16]): draw.line([tuple(pose[14]), tuple(pose[16])],width = thickness, fill=(51,102,51)) #draw LHip -> LKnee(11 ->13) if all(pose[11]) and all(pose[13]): draw.line([tuple(pose[11]), tuple(pose[13])],width = thickness, fill=(51,51,204)) #draw LKnee -> LFoot (13 ->15) if all(pose[13]) and all(pose[15]): draw.line([tuple(pose[13]), tuple(pose[15])],width = thickness, fill=(51,51,204)) return image, draw_person parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example") parser.add_argument("--image", type=str, default = "demo/image_multi.png", help="image file name") args = parser.parse_args() cfg = load_config("demo/pose_cfg_multi.yaml") dataset = create_dataset(cfg) sm = SpatialModel(cfg) sm.load() #draw_multi = PersonDraw() # Load and setup CNN part detector sess, inputs, outputs = predict.setup_pose_prediction(cfg) # Read image from file file_name = args.image #image = imread(file_name, mode='RGB') image = Image.open(file_name).convert('RGB') image_batch = data_to_input(image) start = time.time() # Compute prediction with the CNN outputs_np = sess.run(outputs, feed_dict={inputs: image_batch}) scmap, locref, pairwise_diff = predict.extract_cnn_output(outputs_np, cfg, dataset.pairwise_stats) detections = extract_detections(cfg, scmap, locref, pairwise_diff) unLab, pos_array, unary_array, pwidx_array, pw_array = eval_graph(sm, detections) person_conf_multi = get_person_conf_multicut(sm, unLab, unary_array, pos_array) end = time.time() print(person_conf_multi) print('===== Net FPS :%f ====='%( 1 / (end - start))) image, draw_person = draw_coco_points(image, person_conf_multi) image.save('./multi_coco_result[%d].png'%draw_person) end = time.time() print('===== FPS :%f ====='%( 1 / (end - start))) ''' img = np.copy(image) visim_multi = img.copy() fig = plt.imshow(visim_multi) draw_multi.draw(visim_multi, dataset, person_conf_multi) fig.axes.get_xaxis().set_visible(False) fig.axes.get_yaxis().set_visible(False) plt.show() visualize.waitforbuttonpress() '''
<demo_multiperson.py>
You can test the framework with images like this.
(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/demo_multiperson.py ..... num_people: 19 [[[ 66.20384562 101.43434691] [ 69.89566374 95.22151852] [ 59.09224147 95.70417881] [ 72.68740892 98.12449586] [ 45.67895818 98.08686912] [ 81.53892303 124.13856415] [ 33.19268489 124.87683523] [ 86.51360297 156.70794803] [ 0. 0. ] [ 90.48762476 189.30409873] [ 0. 0. ] [ 77.88260317 197.96489549] [ 60.26689658 197.26194894] [ 0. 0. ] [ 73.94066644 259.91335434] [ 0. 0. ] [ 0. 0. ]] ....... [[ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ] [547.84042802 252.4423849 ]]] ===== Net FPS :0.031365 ===== ===== FPS :0.030672 =====
And the last valid number is indicated in the file name. The model judged that there were a total of 19 people, but it was the result of processing using only 11 data, excluding 8 data with poor coordinates.
Accuracy comparison with mobilenet
A few example images show that resnet-101 is more accurate.
resnet-101 test images
mobilenet-thin test images
mobilenet-v2-small test images
mobilenet-v2-large test images
Video file test and check the FPS
import os import sys import numpy as np sys.path.append(os.path.dirname(__file__) + "/../") #from scipy.misc import imread, imsave from PIL import Image, ImageDraw, ImageFont import cv2 import time from util.config import load_config from dataset.factory import create as create_dataset from nnet import predict from util import visualize from dataset.pose_dataset import data_to_input from multiperson.detections import extract_detections from multiperson.predict import SpatialModel, eval_graph, get_person_conf_multicut import argparse sample_dir = '/home/spypiggy/src/test_images/' parser = argparse.ArgumentParser(description="Tensorflow Pose Estimation Example") parser.add_argument("--video", type=str, default = sample_dir + "video.avi", help="video file name") parser.add_argument("--res", type=str, default = "640x320", help="video file resolution") args = parser.parse_args() res = args.res.split('x') inference_w, inference_h = int(res[0]), int(res[1]) ''' Total 17 points in COCO ''' def validate_coco_pose(pose): err = 0 for p in pose: if p[0] < 0.1 or p[1] < 0.1 : err += 1 if err > 8 : return False return True def draw_coco_points(image, persons): fontname = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc' fnt = ImageFont.truetype(fontname, 15) draw = ImageDraw.Draw(image) radius = 3 clr = (0,255,0) draw_person = 0 thickness = 3 for j in range(len(persons)): pose = persons[j] if False == validate_coco_pose(pose): continue draw_person += 1 #if j < 11: # continue for i in range(len(pose)): p = pose[i] cx = p[0] cy = p[1] if cx < 0.1 and cy < 0.1 : continue draw.ellipse((cx - radius, cy - radius, cx + radius, cy + radius), outline = clr, width=3) draw.text((cx + 10, cy), "%d"%i, font=fnt, fill=(255,255,255)) #draw nose -> REye (0-> 2) if all(pose[0]) and all(pose[2]): draw.line([tuple(pose[0]), tuple(pose[2])],width = thickness, fill=(219,0,219)) #draw nose -> LEye (0-> 1) if all(pose[0]) and all(pose[1]): draw.line([tuple(pose[0]), tuple(pose[1])],width = thickness, fill=(219,0,219)) #draw LEye ->LEar(1-> 3) if all(pose[1]) and all(pose[3]): draw.line([tuple(pose[1]), tuple(pose[3])],width = thickness, fill=(219,0,219)) #draw REye ->REar(2-> 4) if all(pose[2]) and all(pose[4]): draw.line([tuple(pose[2]), tuple(pose[4])],width = thickness, fill=(219,0,219)) #draw RShoulder ->RHip(6-> 12) if all(pose[6]) and all(pose[12]): draw.line([tuple(pose[6]), tuple(pose[12])],width = thickness, fill=(153,0,51)) #draw LShoulder ->LHip(5-> 11) if all(pose[5]) and all(pose[11]): draw.line([tuple(pose[5]), tuple(pose[11])],width = thickness, fill=(153,0,51)) #draw RShoulder -> LShoulder (6-> 5) if all(pose[6]) and all(pose[5]): draw.line([tuple(pose[6]), tuple(pose[5])],width = thickness, fill=(255,102,51)) #draw RShoulder -> RElbow(6-> 8) if all(pose[6]) and all(pose[8]): draw.line([tuple(pose[6]), tuple(pose[8])],width = thickness, fill=(255,255,51)) #draw RElbow -> RWrist (8 ->10) if all(pose[8]) and all(pose[10]): draw.line([tuple(pose[8]), tuple(pose[10])],width = thickness, fill=(255,255,51)) #draw LShoulder -> LElbow (5-> 7 ) if all(pose[5]) and all(pose[7]): draw.line([tuple(pose[5]), tuple(pose[7])],width = thickness, fill=(51,255,51)) #draw LElbow -> LWrist (7 ->9) if all(pose[7]) and all(pose[9]): draw.line([tuple(pose[7]), tuple(pose[9])],width = thickness, fill=(51,255,51)) #draw RHip -> RKnee (12 ->14) if all(pose[12]) and all(pose[14]): draw.line([tuple(pose[12]), tuple(pose[14])],width = thickness, fill=(51,102,51)) #draw RKnee -> RFoot (14 ->16) if all(pose[14]) and all(pose[16]): draw.line([tuple(pose[14]), tuple(pose[16])],width = thickness, fill=(51,102,51)) #draw LHip -> LKnee(11 ->13) if all(pose[11]) and all(pose[13]): draw.line([tuple(pose[11]), tuple(pose[13])],width = thickness, fill=(51,51,204)) #draw LKnee -> LFoot (13 ->15) if all(pose[13]) and all(pose[15]): draw.line([tuple(pose[13]), tuple(pose[15])],width = thickness, fill=(51,51,204)) return image, draw_person cfg = load_config("demo/pose_cfg_multi.yaml") dataset = create_dataset(cfg) sm = SpatialModel(cfg) sm.load() # Load and setup CNN part detector sess, inputs, outputs = predict.setup_pose_prediction(cfg) cap = cv2.VideoCapture(args.video) if cap is None: print("Video[%s] Open Error"%(args.video)) sys.exit(0) ret_val, img = cap.read() if ret_val == False: print('No valid video frame') sys.exit(0) height, width, _ = img.shape fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v') out_video = cv2.VideoWriter('/tmp/resnet_output.mp4', fourcc, cap.get(cv2.CAP_PROP_FPS), (inference_w, inference_h)) count = 0 t_netfps_time = 0 t_fps_time = 0 start = time.time() while cap.isOpened(): ret_val, dst = cap.read() if ret_val == False: print("Frame read End") break image = Image.fromarray(np.uint8(dst)) #OpenCV format -> PIL Format image = image.resize((inference_w, inference_h)) image_batch = data_to_input(image) net_start = time.time() # Compute prediction with the CNN outputs_np = sess.run(outputs, feed_dict={inputs: image_batch}) scmap, locref, pairwise_diff = predict.extract_cnn_output(outputs_np, cfg, dataset.pairwise_stats) detections = extract_detections(cfg, scmap, locref, pairwise_diff) unLab, pos_array, unary_array, pwidx_array, pw_array = eval_graph(sm, detections) person_conf_multi = get_person_conf_multicut(sm, unLab, unary_array, pos_array) net_end = time.time() netfps = 1.0 / (net_end - net_start) print('Frame[%d] ===== Net FPS :%f ====='%(count + 1, netfps)) image, draw_person = draw_coco_points(image, person_conf_multi) img = np.asarray(image, dtype="uint8") #PIL Format -> OpenCV format fps = 1.0 / (time.time() - start) cv2.putText(img , "FPS[%4.1f] NET_FPS[%4.1f]"%(fps, netfps), (20, 40), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2) out_video.write(img) start = time.time() t_netfps_time += netfps t_fps_time += fps count += 1 print("==== Summary ====") print("Video Resolution W[%d] H[%d] -> inference size W[%d] H[%d]"%(width, height, inference_w, inference_h)) if count: print("avg fps[%f] avg net_fps[%f]"%(t_fps_time / count, t_netfps_time / count)) cv2.destroyAllWindows() out_video.release() cap.release()
<video_multiperson.py>
For testing, we will use the video file used by OpenPose.
(python) spypiggy@XavierNX:~/src/pose-tensorflow$ python3 demo/video_multiperson.py --video='../test_images/video.avi' --res=432x368 ...... ==== Summary ==== Video Resolution W[1280] H[720] -> inference size W[640] H[480] avg fps[1.478159] avg net_fps[1.709562]
KeyPoint detection was performed by converting the resolution (1280X720) of the original video file to Inference size (432X368). Roughly, 2.7 FPS performance can be obtained.
I had tested the same video file using Mobilenet in a previous blog. Let's compare the results using MobileNet and ResNet-101.
<resnet-101 result video>
Wrapping Up
As you can see from the picture, you can see that resnet-101 detects KeyPoint much more accurately. However, ResNet-101 is relatively accurate and the model is heavy, so the processing speed is relatively low. A performance of about 1.7 FPS can be obtained. If the inference size is reduced to about 480X320, the performance of 2.7 to 3 FPS can be obtained. However, if the image is reduced too much, KeyPoint detection performance may decrease, so it is good to determine the proper inference size through testing.
댓글 없음:
댓글 쓰기