2023년 3월 14일 화요일

Xavier NX - YOLOv8 Video Object Detection (JetPack 5.1)

 I mainly use OpenCV for image and video processing. However, a new function has been added for video processing in PyTorch. A new VideoReader class has been added to torchvision.io in addition to the previously used read_video function.

Now (2023.3) is in the beta stage, but it seems to be mainly used for video processing in the future.

The following is the VideoReader guide page on the PyTorch homepage.


READING/WRITING IMAGES AND VIDEOS

The torchvision.io package provides functions for performing IO operations. They are currently specific to reading and writing video and images.

Video

read_video(filename[, start_pts, end_pts, ...])

Reads a video from a file, returning both the video frames as well as the audio frames

read_video_timestamps(filename[, pts_unit])

List the video frames timestamps.

write_video(filename, video_array, fps[, ...])

Writes a 4d tensor in [T, H, W, C] format in a video file

Fine-grained video API

In addition to the read_video function, we provide a high-performance lower-level API for more fine-grained control compared to the read_video function. It does all this whilst fully supporting torchscript.

WARNING

The fine-grained video API is in Beta stage, and backward compatibility is not guaranteed.

VideoReader(path[, stream, num_threads, device])

Fine-grained video-reading API


For video processing, I will compare the performance of using OpenCV, which I mainly used before, and the new videoReader of torchvision.io.

And in the Anaconda virtual environment, OpenCV 4.7 was built directly to support OpenCV video processing. So I will be using OpenCV 4.7 in Anaconda environment.

First, we will compare the video processing speed by reading video frames using OpenCV and then inputting them to the YOLOv8 model.


YOLOv8 Video Detection

All ML models take a lot of time to load. And it takes a lot of time to process the first instance. Therefore, I will process the first frame after loading the model in the performance measurement and then performance measurement from the second frame.

The video to be used for the test is a 340X256 video with a playback time of 10.922 seconds.

<Test Video file WUzgd7C1pWA.mp4>

This file can be downloaded from https://github.com/pytorch/vision/blob/main/test/assets/videos.


YOLOv8 video processing using only OpenCV

First, I will process the video frame using OpenCV. Opening and framing video files using OpenCV is a method that has been handled a lot in the previous example, and it is very easy to use because there are many examples. 

And for OpenCV video processing in anaconda virtual environment, I have built and used OpenCV 4.7 myself for Anaconda. OpenCV 4.7 build for Anaconda

Refer to Installing the Latest Version of OpenCV (ver 4.7) on Xavier NX (JetPack 5.1).


from ultralytics import YOLO
import cv2
import time, sys
import torchvision
import torchvision.transforms as T

colors = [(255,0 , 0), (0,255,0), (0,0,255)]
font = cv2.FONT_HERSHEY_SIMPLEX   
fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')

def draw(img, boxes):
    index = 0
    for box in boxes.data:
        p1 =  (int(box[0].item()), int(box[1].item()))
        p2 =  (int(box[2].item()), int(box[3].item()))
        img = cv2.rectangle(img, p1, p2, colors[index % len(colors)], 3)
        text = label_map[int(box[5].item())] + " %4.2f"%(box[4].item()) 
        cv2.putText(img, text, (p1[0], p1[1] - 10), font, fontScale = 1, color = colors[index % len(colors)], thickness = 2)
        index += 1
    # cv2.imshow("draw", img)
    # cv2.waitKey(1)
    out_video.write(img)


# Load a model
model = YOLO("yolov8n.pt")  # load an official model
label_map = model.names

f = 0
net_total = 0.0
total = 0.0

cap = cv2.VideoCapture("./WUzgd7C1pWA.mp4")
# Skip First frame
ret, img = cap.read()
if ret == False:
    print('Video File Read Error')    
    sys.exit(0)

results = model(img)  # predict on an image
h, w, c = img.shape
print('Video Frame shape H:%d, W:%d, Channel:%d'%(h, w, c))

fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
out_video = cv2.VideoWriter('./cv_result.mp4', fourcc, cap.get(cv2.CAP_PROP_FPS), (w, h))


while cap.isOpened():
    s = time.time()
    ret, img = cap.read()
    if ret == False:
        break

    results = model(img)  # predict on an image
    net_e = time.time()
    for result in results:
        draw(result.orig_img, result.boxes)
    e = time.time()
    net_total += (net_e - s)
    total += (e - s)
    f += 1

    
fps = f / total 
net_fps = f / net_total 

print("Total processed frames:%d"%f)
print("FPS:%4.2f"%fps)
print("Net FPS:%4.2f"%net_fps)
cv2.destroyAllWindows()
cap.release()
out_video.release()

<video_detect_cv.py>

Now let's run Python code. You must run Python code in the YOLOv8 virtual environment that we have created so far.

(yolov8) spypiggy@spypiggy-NX:~/src/yolov8$ python video_detect_cv.py 

0: 512x640 4 persons, 100.9ms
Speed: 3.9ms preprocess, 100.9ms inference, 12.0ms postprocess per image at shape (1, 3, 640, 640)
Video Frame shape H:256, W:340, Channel:3

......

0: 512x640 4 persons, 1 car, 1 truck, 35.2ms
Speed: 1.6ms preprocess, 35.2ms inference, 5.8ms postprocess per image at shape (1, 3, 640, 640)

0: 512x640 4 persons, 1 truck, 35.6ms
Speed: 1.5ms preprocess, 35.6ms inference, 5.7ms postprocess per image at shape (1, 3, 640, 640)
Total processed frames:326
FPS:15.20
Net FPS:18.27

And this program creates a cv_result.mp4 video that displays the recognized results in a bounding box. If you open the video, you can see that it is displayed properly.

<cv_result.mp4>


The FPS marked two things. The first is the calculation of the time of the entire process, including inputting a frame into the YOLOv8 model and receiving a result value, screen processing of the result value, and storing a video.

And the second is the value calculated only the processing time of the YOLOv8 model.

YOLOv8's lightest model, yolov8n.Considering the use of pt, the values of FPS:15.20 and NetFPS:18.27 are not considered excellent performance.


YOLOv8 video processing using OpenCV and PyTorch VideoReader

Second, I will use VideoReader provided by Pythochi. However, Video Reader is used to read frames and is not directly related to the YOLOv8 model, so the performance will not be much different.

from ultralytics import YOLO
import cv2
import time, sys
import torchvision
import torchvision.transforms as T

colors = [(255,0 , 0), (0,255,0), (0,0,255)]
font = cv2.FONT_HERSHEY_SIMPLEX   
fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')

def draw(img, boxes):
    index = 0
    for box in boxes.data:
        p1 =  (int(box[0].item()), int(box[1].item()))
        p2 =  (int(box[2].item()), int(box[3].item()))
        img = cv2.rectangle(img, p1, p2, colors[index % len(colors)], 3)
        text = label_map[int(box[5].item())] + " %4.2f"%(box[4].item()) 
        cv2.putText(img, text, (p1[0], p1[1] - 10), font, fontScale = 1, color = colors[index % len(colors)], thickness = 2)
        index += 1
    # cv2.imshow("draw", img)
    # cv2.waitKey(1)
    out_video.write(img)


# Load a model
model = YOLO("yolov8n.pt")  # load an official model
label_map = model.names

f = 0
net_total = 0.0
total = 0.0
reader = torchvision.io.VideoReader("./WUzgd7C1pWA.mp4", "video")
meta = reader.get_metadata()
fps = meta["video"]['fps'][0]
frame = next(reader)
img = T.ToPILImage()(frame["data"])
results = model(img)  # predict on an image

shape = frame['data'].shape     # pytorch image tensor shape is C H W
size = (frame['data'].shape[2], frame['data'].shape[1]) 
out_video = cv2.VideoWriter('./torch_result.mp4', fourcc, fps, size)


while frame:
    s = time.time()
    try:
        frame = next(reader)
    except StopIteration:    
        break
    img = T.ToPILImage()(frame["data"])
    results = model(img)  # predict on an image
    net_e = time.time()

    for result in results:
        draw(result.orig_img, result.boxes)
    e = time.time()
    net_total += (net_e - s)
    total += (e - s)
    f += 1

    
fps = f / total 
net_fps = f / net_total 

print("Total processed frames:%d"%f)
print("FPS:%4.2f"%fps)
print("Net FPS:%4.2f"%net_fps)
cv2.destroyAllWindows()
out_video.release()

<video_detect_torch.py>


let's run torchvision based Python code. 

(yolov8) spypiggy@spypiggy-NX:~/src/yolov8$ python video_detect_torch.py 

0: 512x640 4 persons, 83.5ms
Speed: 2.3ms preprocess, 83.5ms inference, 10.8ms postprocess per image at shape (1, 3, 640, 640)

0: 512x640 2 persons, 50.0ms
Speed: 2.3ms preprocess, 50.0ms inference, 7.0ms postprocess per image at shape (1, 3, 640, 640)

......

0: 512x640 4 persons, 1 car, 1 truck, 33.7ms
Speed: 1.4ms preprocess, 33.7ms inference, 5.4ms postprocess per image at shape (1, 3, 640, 640)

0: 512x640 4 persons, 1 truck, 33.5ms
Speed: 1.5ms preprocess, 33.5ms inference, 5.4ms postprocess per image at shape (1, 3, 640, 640)
Total processed frames:326
FPS:14.22
Net FPS:16.87

Compared to the case of using only OpenCV, it can be seen that the speed decreased slightly. However, it is difficult to evaluate the exact performance of the Video Reader because it is the result of only one test using a 10-second low-resolution video. 

And it is the performance of the YOLO model that has the greatest impact on the actual FPS. This time, we will test the code that has slightly increased the size of the model. Let's test it by changing the name of the model from the smallest "yolo8n.pt" to the medium "yolo8m.pt" in the previous source codes.


YOLOv8 Video Proecssing Performance

The table below shows the performance (FPS) when executing Object Detection using sample videos and storing videos showing bounding boxes.

<Video Processing Performance by Model>


If you use a higher resolution video file, it will probably slow down a little bit.

I use 10 FPS as the basis when evaluating the model. If it is less than 10 FPS, it is judged that it is difficult to use in commercial projects.  For YOLOv8, yolov8l.It is not easy to use in Xavier NX because the processing speed is low from the pt model.


Wrapping up

We briefly looked at the method and performance of using the YOLOv8 object detection model for video.

Personally, I feel that Xavier NX has lower performance than I thought.

In the next article, we will learn how to convert the YOLOv8 model to TensorRT, which can be optimized for NVidia GPUs, to increase performance.

The source code can be downloaded from my GitHub.









댓글 없음:

댓글 쓰기