Wednesday, December 18, I've described object detection in NVidia's vision libeary in another blog.

In this post, we will compare the performance and accuracy of Facebook's DETR and NVidia's DNN vision libeary described in the previous article. However, there are some restrictions.
Direct network performance comparison is difficult because the network model supported by NVidai detectNet and the model supported by DETR are different.

	supported models
NVidia detectNet	ssd-mobilenet-v2, ssd-inception-v2, pednet, multiped, facenet, coco-airplane, coco-chair, coco-dog
FaceBook DETR	detr_resnet50, detr_resnet50_dc5, detr_resnet50_dc5_panoptic, detr_resnet50_panoptic, detr_resnet101, detr_resnet101_dc5, detr_resnet101_dc5_panoptic, detr_resnet101_panoptic

For testing, NVidia detectNet used ssd-mobilenet and DETR used detr_resnet50. For the inference image to be used for testing, some of the images provided with detectnet were used. "https://spyjetson.blogspot.com/2019/12/jetsonnano-hello-ai-world-nvidia-dnn.html" explains how to install the NVidia DNN vision library.

Test codes

I will proceed with the test in the following order.

Create the "/usr/local/src/test_images" directory in advance, then copy the test image described above. 52 files were used for the test.
The finished files will be stored in "/usr/local/src/result".
The file name format to save is "Original File Name + Model Name + FPS".
It doesn't matter which of the two models you test first.
It works on all jpg files in the directory containing the original inference image and saves the resulting image.

The following code is a modification of the detr.py code I created while describing DETR in the previous blog.

import torch as th
import torchvision
import torchvision.transforms as T
import requests, sys, time, os
from PIL import Image, ImageDraw, ImageFont
import argparse
import gc 

print('pytorch', th.__version__)
print('torchvision', torchvision.__version__)

parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default="resnet50", help='network model -> resnet50 or resnet101 or resnet50_dc5 or  resnet50_panoptic')
parser.add_argument("--threshold", type=float, default=0.7, help="minimum detection threshold to use")
args = parser.parse_args()

'''
#if you want to view supported models, use these codes.
name = th.hub.list('facebookresearch/detr');
print(name)
'''
if args.model == 'resnet50':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
elif args.model == 'resnet50_dc5':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5', pretrained=True)
elif args.model == 'resnet50_dc5_panoptic':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5_panoptic', pretrained=True)
elif args.model == 'resnet50_panoptic':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet50_panoptic', pretrained=True)

elif args.model == 'resnet101':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet101', pretrained=True)
elif args.model == 'resnet101_dc5':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5', pretrained=True)
elif args.model == 'resnet101_dc5_panoptic':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5_panoptic', pretrained=True)
elif args.model == 'resnet101_panoptic':
    model = th.hub.load('facebookresearch/detr', 'detr_resnet101_panoptic', pretrained=True)
    
else:    
    print('Unknown network name[%s]'%(args.model))
    sys.exit(0)

t1 = time.time()    
model.eval()
model = model.cuda()
print('model[%s] load success'%args.model)
t2 = time.time()
print("======== Network Load time:%f"%(t2 - t1))


transform = T.Compose([
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
CLASSES = [
    'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A',
    'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack',
    'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
    'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass',
    'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
    'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
    'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A',
    'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A',
    'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier',
    'toothbrush'
]
COLORS = [(0, 45, 74, 127), (85, 32, 98, 127), (93, 69, 12, 127),
          (49, 18, 55, 127), (46, 67, 18, 127), (30, 74, 93, 127)]
          
src_path = '/usr/local/src/test_images'
dest_path = '/usr/local/src/result'

fnt = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf', 16)
fnt2 = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 30)


for root, dirs, files in os.walk(src_path):
    for fname in files:
        gc.collect()
        full_fname = os.path.join(root, fname)
        img = Image.open(full_fname).convert('RGB')
        W, H = img.size
        t1 = time.time()
        img_tens = transform(img).unsqueeze(0).cuda()

        fps_time  = time.perf_counter()
        with th.no_grad():
          output = model(img_tens)

        elapsed = time.time() - t1
        fps = 1.0 / elapsed
        print("FPS:%f"%(fps))


        im2 = img.copy()
        drw = ImageDraw.Draw(im2, 'RGBA')
        pred_logits=output['pred_logits'][0]
        pred_boxes=output['pred_boxes'][0]

        color_index = 0
        for logits, box in zip(pred_logits, pred_boxes):
            m = th.nn.Softmax(dim=0)
            prob = m(logits)
            top3 = th.topk(logits, 3)
            if top3.indices[0] >= len(CLASSES) or prob[top3.indices[0]] < args.threshold:
                continue
              
            print(' ===== print top3 values =====')
            print('top3', top3)

            print('top 1: Label[%-20s]  probability[%5.3f]'%(CLASSES[top3.indices[0]], prob[top3.indices[0]] * 100))

            if top3.indices[1] < len(CLASSES) :
                print('top 2: Label[%-20s]  probability[%5.3f]'%(CLASSES[top3.indices[1]], prob[top3.indices[1]] * 100))

            if top3.indices[2] < len(CLASSES) :
                print('top 3: Label[%-20s]  probability[%5.3f]'%(CLASSES[top3.indices[2]], prob[top3.indices[2]] * 100))
            
            
            cls = top3.indices[0]
            label = '%s-%4.2f'%(CLASSES[cls], prob[cls] * 100 )
            print(label)
            box = box.cpu() * th.Tensor([W, H, W, H])
            x, y, w, h = box
            x0, x1 = x-w//2, x+w//2
            y0, y1 = y-h//2, y+h//2
            color = COLORS[color_index % len(COLORS)]
            color_index += 1            
            drw.rectangle([x0, y0, x1, y1], fill = color, width=5)
            drw.text((x, y), label, font=fnt,fill='white')
            
        output = None
        th.cuda.empty_cache()
        
        drw.text((5, 5), 'FPS-%4.2f'%(fps), font=fnt2,fill='green')
        s = os.path.splitext(fname)[0]
        out_name = os.path.join(dest_path, s + '_detr_%s_%4.2f.jpg'%(args.model, fps))
        im2.save(out_name)

<detr_dir.py>

The following code is a modification of the detectnet-console.py code created while describing the previous detectNet.

#!/usr/bin/python3
#
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
#

import jetson.inference
import jetson.utils

import argparse
import sys, time, os

# parse the command line
parser = argparse.ArgumentParser(description="Locate objects in an image using an object detection DNN.", 
						   formatter_class=argparse.RawTextHelpFormatter, epilog=jetson.inference.detectNet.Usage())

parser.add_argument("--network", type=str, default="ssd-mobilenet-v2", help="pre-trained model to load (see below for options)")
parser.add_argument("--overlay", type=str, default="box,labels,conf", help="detection overlay flags (e.g. --overlay=box,labels,conf)\nvalid combinations are:  'box', 'labels', 'conf', 'none'")
parser.add_argument("--threshold", type=float, default=0.5, help="minimum detection threshold to use")

try:
	opt = parser.parse_known_args()[0]
except:
	print("")
	parser.print_help()
	sys.exit(0)


# load the object detection network
t1 = time.time()
net = jetson.inference.detectNet(opt.network, sys.argv, opt.threshold)
t2 = time.time()
print("======== Network Load time:%f"%(t2 - t1))


src_path = '/usr/local/src/test_images'
dest_path = '/usr/local/src/result'

for root, dirs, files in os.walk(src_path):
    for fname in files:
        full_fname = os.path.join(root, fname)
        img, width, height = jetson.utils.loadImageRGBA(full_fname)
        t1 = time.time()
        detections = net.Detect(img, width, height, opt.overlay)
        elapsed = time.time() - t1
        # print the detections
        print("detected {:d} objects in image".format(len(detections)))
        fps = 1.0 / elapsed
        print("FPS:%f"%(fps))
        for detection in detections:
            print(detection)

        # print out timing info
        net.PrintProfilerTimes()

        # save the output image with the bounding box overlays
        s = os.path.splitext(fname)[0]
        out_name = os.path.join(dest_path, s + '_detectnet_%s_%4.2f.jpg'%(opt.network, fps))
        jetson.utils.saveImageRGBA(out_name, img, width, height)

<detectnet_dir.py>

Test Results

Performance

When it comes to recognition speed, NVidia's detectNet is overwhelming. While the speed of detectNet using ssd-mobilenet-v2 recorded more than 20FPS, DETR using resnet50 only recorded 0.X or 1.X FPS.

Accuracy

The following is a comparison of some of the test results. The left image is the result of using detectnet and the right image is DETR.

<bird_2.jpg>

<dog_0.jpg>

<dog_3.jpg>

<peds_3.jpg>

Overall, DETR's perception seems to be more accurate. However, in the case of DETR, the overlapping recognition part is sometimes noticeable. Even in the case of peds_3.jpg, it can be confirmed that the suitcase was repeatedly recognized.

In the following picture, you can see that the words are recognized again.

<horse_2.jpg>

Wrapping up

Because the ssd-mobilenet-v2 is lighter than the resnet-50 model, it is difficult to evaluate that DETR is slow only by the above comparison. However, considering the limited model currently available in the Jetson Nano, the slow processing speed of DETR, which requires the use of resnet50 in object detection, is a problem.

In the future, I expect DETR to continue developing and improving models suitable for small devices such as Jetson Nano.

You can download the source code and image files at https://github.com/raspberry-pi-maker/NVIDIA-Jetson .

NVIDIA Jetson and Raspberry Pi

2020년 6월 7일 일요일

Jetson Nano - DE⫶TR: vs NVIDIA DNN vision library(detectNet)

Test codes

Test Results

Performance

Accuracy

Wrapping up

댓글 없음:

댓글 쓰기