Wednesday, December 18, I've described object detection in NVidia's vision libeary in another blog.
<object detection using the NVidia DNN vision library>
In this post, we will compare the performance and accuracy of Facebook's DETR and NVidia's DNN vision libeary described in the previous article. However, there are some restrictions.
Direct network performance comparison is difficult because the network model supported by NVidai detectNet and the model supported by DETR are different.
Direct network performance comparison is difficult because the network model supported by NVidai detectNet and the model supported by DETR are different.
supported models | |
NVidia detectNet | ssd-mobilenet-v2, ssd-inception-v2, pednet, multiped, facenet, coco-airplane, coco-chair, coco-dog |
FaceBook DETR | detr_resnet50, detr_resnet50_dc5, detr_resnet50_dc5_panoptic, detr_resnet50_panoptic, detr_resnet101, detr_resnet101_dc5, detr_resnet101_dc5_panoptic, detr_resnet101_panoptic |
For testing, NVidia detectNet used ssd-mobilenet and DETR used detr_resnet50. For the inference image to be used for testing, some of the images provided with detectnet were used. "https://spyjetson.blogspot.com/2019/12/jetsonnano-hello-ai-world-nvidia-dnn.html" explains how to install the NVidia DNN vision library.
Test codes
I will proceed with the test in the following order.
- Create the "/usr/local/src/test_images" directory in advance, then copy the test image described above. 52 files were used for the test.
- The finished files will be stored in "/usr/local/src/result".
- The file name format to save is "Original File Name + Model Name + FPS".
- It doesn't matter which of the two models you test first.
- It works on all jpg files in the directory containing the original inference image and saves the resulting image.
The following code is a modification of the detr.py code I created while describing DETR in the previous blog.
import torch as th import torchvision import torchvision.transforms as T import requests, sys, time, os from PIL import Image, ImageDraw, ImageFont import argparse import gc print('pytorch', th.__version__) print('torchvision', torchvision.__version__) parser = argparse.ArgumentParser() parser.add_argument('--model', type=str, default="resnet50", help='network model -> resnet50 or resnet101 or resnet50_dc5 or resnet50_panoptic') parser.add_argument("--threshold", type=float, default=0.7, help="minimum detection threshold to use") args = parser.parse_args() ''' #if you want to view supported models, use these codes. name = th.hub.list('facebookresearch/detr'); print(name) ''' if args.model == 'resnet50': model = th.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True) elif args.model == 'resnet50_dc5': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5', pretrained=True) elif args.model == 'resnet50_dc5_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5_panoptic', pretrained=True) elif args.model == 'resnet50_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_panoptic', pretrained=True) elif args.model == 'resnet101': model = th.hub.load('facebookresearch/detr', 'detr_resnet101', pretrained=True) elif args.model == 'resnet101_dc5': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5', pretrained=True) elif args.model == 'resnet101_dc5_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5_panoptic', pretrained=True) elif args.model == 'resnet101_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_panoptic', pretrained=True) else: print('Unknown network name[%s]'%(args.model)) sys.exit(0) t1 = time.time() model.eval() model = model.cuda() print('model[%s] load success'%args.model) t2 = time.time() print("======== Network Load time:%f"%(t2 - t1)) transform = T.Compose([ T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) CLASSES = [ 'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ] COLORS = [(0, 45, 74, 127), (85, 32, 98, 127), (93, 69, 12, 127), (49, 18, 55, 127), (46, 67, 18, 127), (30, 74, 93, 127)] src_path = '/usr/local/src/test_images' dest_path = '/usr/local/src/result' fnt = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf', 16) fnt2 = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 30) for root, dirs, files in os.walk(src_path): for fname in files: gc.collect() full_fname = os.path.join(root, fname) img = Image.open(full_fname).convert('RGB') W, H = img.size t1 = time.time() img_tens = transform(img).unsqueeze(0).cuda() fps_time = time.perf_counter() with th.no_grad(): output = model(img_tens) elapsed = time.time() - t1 fps = 1.0 / elapsed print("FPS:%f"%(fps)) im2 = img.copy() drw = ImageDraw.Draw(im2, 'RGBA') pred_logits=output['pred_logits'][0] pred_boxes=output['pred_boxes'][0] color_index = 0 for logits, box in zip(pred_logits, pred_boxes): m = th.nn.Softmax(dim=0) prob = m(logits) top3 = th.topk(logits, 3) if top3.indices[0] >= len(CLASSES) or prob[top3.indices[0]] < args.threshold: continue print(' ===== print top3 values =====') print('top3', top3) print('top 1: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[0]], prob[top3.indices[0]] * 100)) if top3.indices[1] < len(CLASSES) : print('top 2: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[1]], prob[top3.indices[1]] * 100)) if top3.indices[2] < len(CLASSES) : print('top 3: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[2]], prob[top3.indices[2]] * 100)) cls = top3.indices[0] label = '%s-%4.2f'%(CLASSES[cls], prob[cls] * 100 ) print(label) box = box.cpu() * th.Tensor([W, H, W, H]) x, y, w, h = box x0, x1 = x-w//2, x+w//2 y0, y1 = y-h//2, y+h//2 color = COLORS[color_index % len(COLORS)] color_index += 1 drw.rectangle([x0, y0, x1, y1], fill = color, width=5) drw.text((x, y), label, font=fnt,fill='white') output = None th.cuda.empty_cache() drw.text((5, 5), 'FPS-%4.2f'%(fps), font=fnt2,fill='green') s = os.path.splitext(fname)[0] out_name = os.path.join(dest_path, s + '_detr_%s_%4.2f.jpg'%(args.model, fps)) im2.save(out_name)
<detr_dir.py>
The following code is a modification of the detectnet-console.py code created while describing the previous detectNet.
#!/usr/bin/python3 # # Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the "Software"), # to deal in the Software without restriction, including without limitation # the rights to use, copy, modify, merge, publish, distribute, sublicense, # and/or sell copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER # DEALINGS IN THE SOFTWARE. # import jetson.inference import jetson.utils import argparse import sys, time, os # parse the command line parser = argparse.ArgumentParser(description="Locate objects in an image using an object detection DNN.", formatter_class=argparse.RawTextHelpFormatter, epilog=jetson.inference.detectNet.Usage()) parser.add_argument("--network", type=str, default="ssd-mobilenet-v2", help="pre-trained model to load (see below for options)") parser.add_argument("--overlay", type=str, default="box,labels,conf", help="detection overlay flags (e.g. --overlay=box,labels,conf)\nvalid combinations are: 'box', 'labels', 'conf', 'none'") parser.add_argument("--threshold", type=float, default=0.5, help="minimum detection threshold to use") try: opt = parser.parse_known_args()[0] except: print("") parser.print_help() sys.exit(0) # load the object detection network t1 = time.time() net = jetson.inference.detectNet(opt.network, sys.argv, opt.threshold) t2 = time.time() print("======== Network Load time:%f"%(t2 - t1)) src_path = '/usr/local/src/test_images' dest_path = '/usr/local/src/result' for root, dirs, files in os.walk(src_path): for fname in files: full_fname = os.path.join(root, fname) img, width, height = jetson.utils.loadImageRGBA(full_fname) t1 = time.time() detections = net.Detect(img, width, height, opt.overlay) elapsed = time.time() - t1 # print the detections print("detected {:d} objects in image".format(len(detections))) fps = 1.0 / elapsed print("FPS:%f"%(fps)) for detection in detections: print(detection) # print out timing info net.PrintProfilerTimes() # save the output image with the bounding box overlays s = os.path.splitext(fname)[0] out_name = os.path.join(dest_path, s + '_detectnet_%s_%4.2f.jpg'%(opt.network, fps)) jetson.utils.saveImageRGBA(out_name, img, width, height)
<detectnet_dir.py>
Test Results
Performance
When it comes to recognition speed, NVidia's detectNet is overwhelming. While the speed of detectNet using ssd-mobilenet-v2 recorded more than 20FPS, DETR using resnet50 only recorded 0.X or 1.X FPS.
<There is an FPS value at the end of the resulting file name.>
Accuracy
The following is a comparison of some of the test results. The left image is the result of using detectnet and the right image is DETR.
<bird_2.jpg>
<dog_0.jpg>
<dog_3.jpg>
<peds_3.jpg>
Overall, DETR's perception seems to be more accurate. However, in the case of DETR, the overlapping recognition part is sometimes noticeable. Even in the case of peds_3.jpg, it can be confirmed that the suitcase was repeatedly recognized.
In the following picture, you can see that the words are recognized again.
<horse_2.jpg>
Wrapping up
Because the ssd-mobilenet-v2 is lighter than the resnet-50 model, it is difficult to evaluate that DETR is slow only by the above comparison. However, considering the limited model currently available in the Jetson Nano, the slow processing speed of DETR, which requires the use of resnet50 in object detection, is a problem.
In the future, I expect DETR to continue developing and improving models suitable for small devices such as Jetson Nano.
You can download the source code and image files at https://github.com/raspberry-pi-maker/NVIDIA-Jetson .
댓글 없음:
댓글 쓰기