I mainly use OpenCV for image and video processing. However, a new function has been added for video processing in PyTorch. A new VideoReader class has been added to torchvision.io in addition to the previously used read_video function.
Now (2023.3) is in the beta stage, but it seems to be mainly used for video processing in the future.
The following is the VideoReader guide page on the PyTorch homepage.
READING/WRITING IMAGES AND VIDEOS
The
torchvision.io
package provides functions for performing IO operations. They are currently specific to reading and writing video and images.Video
read_video
(filename[, start_pts, end_pts, ...])Reads a video from a file, returning both the video frames as well as the audio frames
read_video_timestamps
(filename[, pts_unit])List the video frames timestamps.
write_video
(filename, video_array, fps[, ...])Writes a 4d tensor in [T, H, W, C] format in a video file
Fine-grained video API
In addition to the
read_video
function, we provide a high-performance lower-level API for more fine-grained control compared to theread_video
function. It does all this whilst fully supporting torchscript.WARNING
The fine-grained video API is in Beta stage, and backward compatibility is not guaranteed.
VideoReader
(path[, stream, num_threads, device])Fine-grained video-reading API
For video processing, I will compare the performance of using OpenCV, which I mainly used before, and the new videoReader of torchvision.io.
And in the Anaconda virtual environment, OpenCV 4.7 was built directly to support OpenCV video processing. So I will be using OpenCV 4.7 in Anaconda environment.
First, we will compare the video processing speed by reading video frames using OpenCV and then inputting them to the YOLOv8 model.
YOLOv8 Video Detection
All ML models take a lot of time to load. And it takes a lot of time to process the first instance. Therefore, I will process the first frame after loading the model in the performance measurement and then performance measurement from the second frame.
The video to be used for the test is a 340X256 video with a playback time of 10.922 seconds.
This file can be downloaded from https://github.com/pytorch/vision/blob/main/test/assets/videos.
YOLOv8 video processing using only OpenCV
First, I will process the video frame using OpenCV. Opening and framing video files using OpenCV is a method that has been handled a lot in the previous example, and it is very easy to use because there are many examples.
And for OpenCV video processing in anaconda virtual environment, I have built and used OpenCV 4.7 myself for Anaconda. OpenCV 4.7 build for Anaconda
Refer to Installing the Latest Version of OpenCV (ver 4.7) on Xavier NX (JetPack 5.1).
from ultralytics import YOLO import cv2 import time, sys import torchvision import torchvision.transforms as T colors = [(255,0 , 0), (0,255,0), (0,0,255)] font = cv2.FONT_HERSHEY_SIMPLEX fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v') def draw(img, boxes): index = 0 for box in boxes.data: p1 = (int(box[0].item()), int(box[1].item())) p2 = (int(box[2].item()), int(box[3].item())) img = cv2.rectangle(img, p1, p2, colors[index % len(colors)], 3) text = label_map[int(box[5].item())] + " %4.2f"%(box[4].item()) cv2.putText(img, text, (p1[0], p1[1] - 10), font, fontScale = 1, color = colors[index % len(colors)], thickness = 2) index += 1 # cv2.imshow("draw", img) # cv2.waitKey(1) out_video.write(img) # Load a model model = YOLO("yolov8n.pt") # load an official model label_map = model.names f = 0 net_total = 0.0 total = 0.0 cap = cv2.VideoCapture("./WUzgd7C1pWA.mp4") # Skip First frame ret, img = cap.read() if ret == False: print('Video File Read Error') sys.exit(0) results = model(img) # predict on an image h, w, c = img.shape print('Video Frame shape H:%d, W:%d, Channel:%d'%(h, w, c)) fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v') out_video = cv2.VideoWriter('./cv_result.mp4', fourcc, cap.get(cv2.CAP_PROP_FPS), (w, h)) while cap.isOpened(): s = time.time() ret, img = cap.read() if ret == False: break results = model(img) # predict on an image net_e = time.time() for result in results: draw(result.orig_img, result.boxes) e = time.time() net_total += (net_e - s) total += (e - s) f += 1 fps = f / total net_fps = f / net_total print("Total processed frames:%d"%f) print("FPS:%4.2f"%fps) print("Net FPS:%4.2f"%net_fps) cv2.destroyAllWindows() cap.release() out_video.release()
<video_detect_cv.py>
Now let's run Python code. You must run Python code in the YOLOv8 virtual environment that we have created so far.
(yolov8) spypiggy@spypiggy-NX:~/src/yolov8$ python video_detect_cv.py 0: 512x640 4 persons, 100.9ms Speed: 3.9ms preprocess, 100.9ms inference, 12.0ms postprocess per image at shape (1, 3, 640, 640) Video Frame shape H:256, W:340, Channel:3 ...... 0: 512x640 4 persons, 1 car, 1 truck, 35.2ms Speed: 1.6ms preprocess, 35.2ms inference, 5.8ms postprocess per image at shape (1, 3, 640, 640) 0: 512x640 4 persons, 1 truck, 35.6ms Speed: 1.5ms preprocess, 35.6ms inference, 5.7ms postprocess per image at shape (1, 3, 640, 640) Total processed frames:326 FPS:15.20 Net FPS:18.27
And this program creates a cv_result.mp4 video that displays the recognized results in a bounding box. If you open the video, you can see that it is displayed properly.
The FPS marked two things. The first is the calculation of the time of the entire process, including inputting a frame into the YOLOv8 model and receiving a result value, screen processing of the result value, and storing a video.
And the second is the value calculated only the processing time of the YOLOv8 model.
YOLOv8's lightest model, yolov8n.Considering the use of pt, the values of FPS:15.20 and NetFPS:18.27 are not considered excellent performance.
YOLOv8 video processing using OpenCV and PyTorch VideoReader
Second, I will use VideoReader provided by Pythochi. However, Video Reader is used to read frames and is not directly related to the YOLOv8 model, so the performance will not be much different.
from ultralytics import YOLO import cv2 import time, sys import torchvision import torchvision.transforms as T colors = [(255,0 , 0), (0,255,0), (0,0,255)] font = cv2.FONT_HERSHEY_SIMPLEX fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v') def draw(img, boxes): index = 0 for box in boxes.data: p1 = (int(box[0].item()), int(box[1].item())) p2 = (int(box[2].item()), int(box[3].item())) img = cv2.rectangle(img, p1, p2, colors[index % len(colors)], 3) text = label_map[int(box[5].item())] + " %4.2f"%(box[4].item()) cv2.putText(img, text, (p1[0], p1[1] - 10), font, fontScale = 1, color = colors[index % len(colors)], thickness = 2) index += 1 # cv2.imshow("draw", img) # cv2.waitKey(1) out_video.write(img) # Load a model model = YOLO("yolov8n.pt") # load an official model label_map = model.names f = 0 net_total = 0.0 total = 0.0 reader = torchvision.io.VideoReader("./WUzgd7C1pWA.mp4", "video") meta = reader.get_metadata() fps = meta["video"]['fps'][0] frame = next(reader) img = T.ToPILImage()(frame["data"]) results = model(img) # predict on an image shape = frame['data'].shape # pytorch image tensor shape is C H W size = (frame['data'].shape[2], frame['data'].shape[1]) out_video = cv2.VideoWriter('./torch_result.mp4', fourcc, fps, size) while frame: s = time.time() try: frame = next(reader) except StopIteration: break img = T.ToPILImage()(frame["data"]) results = model(img) # predict on an image net_e = time.time() for result in results: draw(result.orig_img, result.boxes) e = time.time() net_total += (net_e - s) total += (e - s) f += 1 fps = f / total net_fps = f / net_total print("Total processed frames:%d"%f) print("FPS:%4.2f"%fps) print("Net FPS:%4.2f"%net_fps) cv2.destroyAllWindows() out_video.release()
<video_detect_torch.py>
let's run torchvision based Python code.
(yolov8) spypiggy@spypiggy-NX:~/src/yolov8$ python video_detect_torch.py 0: 512x640 4 persons, 83.5ms Speed: 2.3ms preprocess, 83.5ms inference, 10.8ms postprocess per image at shape (1, 3, 640, 640) 0: 512x640 2 persons, 50.0ms Speed: 2.3ms preprocess, 50.0ms inference, 7.0ms postprocess per image at shape (1, 3, 640, 640) ...... 0: 512x640 4 persons, 1 car, 1 truck, 33.7ms Speed: 1.4ms preprocess, 33.7ms inference, 5.4ms postprocess per image at shape (1, 3, 640, 640) 0: 512x640 4 persons, 1 truck, 33.5ms Speed: 1.5ms preprocess, 33.5ms inference, 5.4ms postprocess per image at shape (1, 3, 640, 640) Total processed frames:326 FPS:14.22 Net FPS:16.87
Compared to the case of using only OpenCV, it can be seen that the speed decreased slightly. However, it is difficult to evaluate the exact performance of the Video Reader because it is the result of only one test using a 10-second low-resolution video.
And it is the performance of the YOLO model that has the greatest impact on the actual FPS. This time, we will test the code that has slightly increased the size of the model. Let's test it by changing the name of the model from the smallest "yolo8n.pt" to the medium "yolo8m.pt" in the previous source codes.
YOLOv8 Video Proecssing Performance
The table below shows the performance (FPS) when executing Object Detection using sample videos and storing videos showing bounding boxes.
If you use a higher resolution video file, it will probably slow down a little bit.
I use 10 FPS as the basis when evaluating the model. If it is less than 10 FPS, it is judged that it is difficult to use in commercial projects. For YOLOv8, yolov8l.It is not easy to use in Xavier NX because the processing speed is low from the pt model.
Wrapping up
We briefly looked at the method and performance of using the YOLOv8 object detection model for video.
Personally, I feel that Xavier NX has lower performance than I thought.
In the next article, we will learn how to convert the YOLOv8 model to TensorRT, which can be optimized for NVidia GPUs, to increase performance.
The source code can be downloaded from my GitHub.