Transformers are a deep learning architecture that has gained popularity in recent years, particularly on problems with sequential data such as natural language processing (NLP) tasks like language modelling and machine translation.
Transformers have also been extended to tasks such as speech recognition, symbolic mathematics, and reinforcement learning. However, in the computer vision field, the model using Transformer is the first DeTR released by Facebook.
“Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR,” the paper authors wrote.
Prerequisites
Models
Model Zoo
We provide baseline DETR and DETR-DC5 models, and plan to include more in future. AP is computed on COCO 2017 val5k, and inference time is over the first 100 val5k COCO images, with torchscript transformer.
name backbone schedule inf_time box AP url size 0 DETR R50 500 0.036 42.0 download 159Mb 1 DETR-DC5 R50 500 0.083 43.3 download 159Mb 2 DETR R101 500 0.050 43.5 download 232Mb 3 DETR-DC5 R101 500 0.097 44.9 download 232Mb COCO val5k evaluation results can be found in this gist.
COCO panoptic val5k models:
name backbone box AP segm AP PQ url size 0 DETR R50 38.8 31.1 43.4 download 165Mb 1 DETR-DC5 R50 40.2 31.9 44.6 download 165Mb 2 DETR R101 40.1 33 45.1 download 237Mb The models are also available via torch hub, to load DETR R50 with pretrained weights simply do:
model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
Using models in the github repo
torch.hub.
load
(github, model, *args, **kwargs)[source]Load a model from a github repo, with pretrained weights.
- Parameters
github (string) – a string with format “repo_owner/repo_name[:tag_name]” with an optional tag/branch. The default branch is master if not specified. Example: ‘pytorch/vision[:hub]’
model (string) – a string of entrypoint name defined in repo’s hubconf.py
*args (optional) – the corresponding args for callable model.
force_reload (bool, optional) – whether to force a fresh download of github repo unconditionally. Default is False.
verbose (bool, optional) – If False, mute messages about hitting local caches. Note that the message about first download is cannot be muted. Default is True.
**kwargs (optional) – the corresponding kwargs for callable model.
- Returns
a single model with corresponding pretrained weights.
Example
>>> model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
Test DETR
import torch as th import torchvision import torchvision.transforms as T import requests, sys, time, os from PIL import Image, ImageDraw, ImageFont import argparse import gc print('pytorch', th.__version__) print('torchvision', torchvision.__version__) parser = argparse.ArgumentParser() parser.add_argument('--file', type=str, default="", help='filename to load') parser.add_argument('--model', type=str, default="resnet50", help='network model -> resnet50 or resnet101 or resnet50_dc5 or resnet50_panoptic')
parser.add_argument("--size", type=str, default='300X200', help="inference size")
parser.add_argument("--threshold", type=float, default=0.7, help="minimum detection threshold to use") args = parser.parse_args() ''' #if you want to view supported models, use these codes. name = th.hub.list('facebookresearch/detr'); print(name) ''' if args.model == 'resnet50': model = th.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True) elif args.model == 'resnet50_dc5': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5', pretrained=True) elif args.model == 'resnet50_dc5_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_dc5_panoptic', pretrained=True) elif args.model == 'resnet50_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet50_panoptic', pretrained=True) elif args.model == 'resnet101': model = th.hub.load('facebookresearch/detr', 'detr_resnet101', pretrained=True) elif args.model == 'resnet101_dc5': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5', pretrained=True) elif args.model == 'resnet101_dc5_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_dc5_panoptic', pretrained=True) elif args.model == 'resnet101_panoptic': model = th.hub.load('facebookresearch/detr', 'detr_resnet101_panoptic', pretrained=True) else: print('Unknown network name[%s]'%(args.model)) sys.exit(0) model.eval() model = model.cuda() print('model[%s] load success'%args.model) transform = T.Compose([ T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) CLASSES = [ 'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ] COLORS = [(0, 45, 74, 127), (85, 32, 98, 127), (93, 69, 12, 127), (49, 18, 55, 127), (46, 67, 18, 127), (30, 74, 93, 127)] tmp = args.size.split('X') W = int(tmp[0]) H = int(tmp[1])
if args.file == '': url = 'https://i.ytimg.com/vi/vrlX3cwr3ww/maxresdefault.jpg' img = Image.open(requests.get(url, stream=True).raw).resize((W,H)).convert('RGB') filename = 'maxresdefault' else: img = Image.open(args.file).convert('RGB') filename = os.path.splitext(os.path.basename(args.file))[0] W, H = img.size print('Image load success') img_tens = transform(img).unsqueeze(0).cuda() count = 0 tfps = 0 fnt = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf', 16) fnt2 = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 30) for i in range (2): fps_time = time.perf_counter() th.cuda.empty_cache() gc.collect() with th.no_grad(): output = model(img_tens) fps = 1.0 / (time.perf_counter() - fps_time) print("Net FPS: %f" % (fps)) if i > 0: tfps += fps count += 1 im2 = img.copy() drw = ImageDraw.Draw(im2, 'RGBA') pred_logits=output['pred_logits'][0] pred_boxes=output['pred_boxes'][0] color_index = 0 for logits, box in zip(pred_logits, pred_boxes): m = th.nn.Softmax(dim=0) prob = m(logits) top3 = th.topk(logits, 3) if top3.indices[0] >= len(CLASSES) or prob[top3.indices[0]] < args.threshold: continue print(' ===== print top3 values =====') print('top3', top3) print('top 1: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[0]], prob[top3.indices[0]] * 100)) if top3.indices[1] < len(CLASSES) : print('top 2: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[1]], prob[top3.indices[1]] * 100)) if top3.indices[2] < len(CLASSES) : print('top 3: Label[%-20s] probability[%5.3f]'%(CLASSES[top3.indices[2]], prob[top3.indices[2]] * 100)) cls = top3.indices[0] label = '%s-%4.2f'%(CLASSES[cls], prob[cls] * 100 ) #print(label) box = box.cpu() * th.Tensor([W, H, W, H]) x, y, w, h = box x0, x1 = x-w//2, x+w//2 y0, y1 = y-h//2, y+h//2 color = COLORS[color_index % len(COLORS)] color_index += 1 #drw.rectangle([x0, y0, x1, y1], outline='red', width=5) drw.rectangle([x0, y0, x1, y1], fill = color) drw.text((x, y), label, font=fnt,fill='white') fps = 1.0 / (time.perf_counter() - fps_time) print("FPS: %f" % (fps)) output = None th.cuda.empty_cache() print('Processing success') if count > 0: print('AVG FPS:%f'%(tfps / count)) drw.text((5, 5), 'FPS-%4.2f'%(tfps / count), font=fnt2,fill='green') im2.save("./%s-%s.png"%(filename, args.model))
Segmentation fault comes from "libcudnn_ops_infer.so.8.0.0".After several Google search, I found something that could be a clue at https://forums.developer.nvidia.com/t/agx-xavier-segmenation-fault-on-deepstream-people-detection-flowtron/125088.
python3 detr.py --file='./maxresdefault.jpg'
Wrapping up
The average value is applied by repeatedly recognizing the image 10 times.
Model | Inference Size | FPS | Remark |
Resnet101 | 300X200 | 1.820813 | Detection Quality is poor |
Resnet101 | 600X400 | 0.543098 | |
Resnet101 | 800X600 | 0.432346 | |
Resnet50 | 300X200 | 2.553107 | Detection Quality is poor |
Resnet50 | 600X400 | 1.082547 | |
Resnet50 | 800X600 | 0.726071 |
댓글 없음:
댓글 쓰기