2019년 11월 29일 금요일

Monitoring Jetson Series

I used Jetson Nano, Jetson TX2, Xavier NX, Ubuntu 18.04 Official image with root account.

Traditional Monitoring

How do you monitor Jetson? Do you use top or htop?

top

Top is a traditional command-line tool for monitoring real-time processes in a Unix/Linux systems, it’s comes preinstalled on most if not all Linux distributions and shows a useful summary of system information including uptime, total number of processes (and number of: running, sleeping, stopped and zombie processes), CPU and RAM usage, and a list of processes or threads currently being managed by the kernel.

htop

htop is an interactive, ncurses-based processes viewer for Linux systems. It is practically a top-like tool, but it displays colorful text, and uses ncurses to implement a text-graphical interface, and allows for output scrolling. It doesn’t come preinstalled on most mainstream Linux distributions.

It has a nicer text-graphics interface, with colored output.
It is easy to use and highly configurable.
Allows for scrolling process list vertically and horizontally to see all processes and complete command lines.
It also displays a process tree and comes with mouse support.
Allows you to easily perform certain functions related to processes (killing, renicing etc) which can be done without entering their PIDs.
htop is also much faster than top.

Jetson monitoring

If you're monitoring Jetson, you're most interested in CPU, memory, and GPU usage. Yes, GPU... Top or htop can't monitor Jetson's GPU information. Therefore, new tools are needed to effectively monitor Jetson series.

tegrastats

The tools to effectively monitor the GPUs of the Nvidia Jetson series are tegrastats.

spypiggy@spypiggy-desktop:~$ tegrastats --help
Usage: tegrastats [-option]
Options:
    --help                  : print this help screen
    --interval <millisec>   : sample the information in <milliseconds>
    --logfile  <filename>   : dump the output of tegrastats to <filename>
    --load_cfg <filename>   : load the information from <filename>
    --save_cfg <filename>   : save the information to <filename>
    --start                 : run tegrastats as a daemon process in the background
    --stop                  : stop any running instances of tegrastats
    --verbose               : print verbose message

Let's run the tegrastats command.

spypiggy@spypiggy-desktop:~$ tegrastats
RAM 385/3964MB (lfb 721x4MB) SWAP 0/8126MB (cached 0MB) CPU [2%@102,1%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@20.5C CPU@22.5C PMIC@100C GPU@22.5C AO@26C thermal@22.5C POM_5V_IN 825/825 POM_5V_GPU 0/0 POM_5V_CPU 123/123
RAM 385/3964MB (lfb 721x4MB) SWAP 0/8126MB (cached 0MB) CPU [2%@102,6%@102,1%@102,2%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@20.5C CPU@22.5C PMIC@100C GPU@22.5C AO@26C thermal@22.5C POM_5V_IN 865/845 POM_5V_GPU 0/0 POM_5V_CPU 123/123
RAM 385/3964MB (lfb 721x4MB) SWAP 0/8126MB (cached 0MB) CPU [3%@102,2%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@20.5C CPU@22.5C PMIC@100C GPU@22.5C AO@26.5C thermal@22.75C POM_5V_IN 825/838 POM_5V_GPU 0/0 POM_5V_CPU 123/123
RAM 385/3964MB (lfb 721x4MB) SWAP 0/8126MB (cached 0MB) CPU [3%@102,1%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@20.5C CPU@22.5C PMIC@100C GPU@22.5C AO@26.5C thermal@22.5C POM_5V_IN 825/835 POM_5V_GPU 0/0 POM_5V_CPU 123/123
RAM 385/3964MB (lfb 721x4MB) SWAP 0/8126MB (cached 0MB) CPU [3%@102,1%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@20C CPU@22.5C PMIC@100C GPU@22C AO@26.5C thermal@22.5C POM_5V_IN 825/833 POM_5V_GPU 0/0 POM_5V_CPU 123/123
^C
spypiggy@spypiggy-desktop:~$

Information does not come at a glance. An important piece of information about the GPU is GR3D_FREQ. In the above output this value is zero. Therefore, it can be estimated that there is little GPU usage.

Jetson stats - ( TUI version of tegrastats ? )

There is a tool jtop that makes tegrastat similar to htop. When jetson-stats is installed, the jetson_stats service is registered. After starting this service, the jtop command is available. Therefore, after installing jetson-stats, execute the service first or execute the jto command after rebooting.

root@JetsonNano:~# pip3 install -U jetson-stats

root@JetsonNano:~# systemctl restart jetson_stats.service

root@JetsonNano:~# systemctl list-units --type service --all

  UNIT                                     LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                  loaded    active   running Accounts Service
  alsa-restore.service                     loaded    active   exited  Save/Restore Sound Card State
  alsa-state.service                       loaded    inactive dead    Manage Sound Card State (restore and store)
  anacron.service                          loaded    inactive dead    Run anacron jobs
  apparmor.service                         loaded    inactive dead    AppArmor initialization
  apport-autoreport.service                loaded    inactive dead    Process error reports when automatic reporting is enabled
  apport.service                           loaded    active   exited  LSB: automatic crash report generation
  apt-daily-upgrade.service                loaded    inactive dead    Daily apt upgrade and clean activities
  apt-daily.service                        loaded    inactive dead    Daily apt download activities
● auditd.service                           not-found inactive dead    auditd.service
  avahi-daemon.service                     loaded    active   running Avahi mDNS/DNS-SD Stack
  bluetooth.service                        loaded    active   running Bluetooth service
  bolt.service                             loaded    active   running Thunderbolt system service
● chronyd.service                          not-found inactive dead    chronyd.service
  colord.service                           loaded    active   running Manage, Install and Generate Color Profiles
● connman.service                          not-found inactive dead    connman.service
● console-screen.service                   not-found inactive dead    console-screen.service
  console-setup.service                    loaded    active   exited  Set console font and keymap
  containerd.service                       loaded    active   running containerd container runtime
  cron.service                             loaded    active   running Regular background program processing daemon
  dbus.service                             loaded    active   running D-Bus System Message Bus
  dns-clean.service                        loaded    inactive dead    Clean up any mess left by 0dns-up
  docker.service                           loaded    inactive dead    Docker Application Container Engine
  emergency.service                        loaded    inactive dead    Emergency Shell
● festival.service                         not-found inactive dead    festival.service
● firewalld.service                        not-found inactive dead    firewalld.service
  fstrim.service                           loaded    inactive dead    Discard unused blocks
  gdm.service                              loaded    active   running GNOME Display Manager
  getty-static.service                     loaded    inactive dead    getty on tty2-tty6 if dbus and logind are not available
  getty@tty1.service                       loaded    inactive dead    Getty on tty1
  getty@tty7.service                       loaded    inactive dead    Getty on tty7
  gpsd.service                             loaded    inactive dead    GPS (Global Positioning System) Daemon
  grub-common.service                      loaded    active   exited  LSB: Record successful boot for GRUB
  haveged.service                          loaded    active   running Entropy daemon using the HAVEGE algorithm
  jetson_stats.service                loaded   active  running jetson_stats service
● kbd.service                              not-found inactive dead    kbd.service
  kerneloops.service                       loaded    active   running Tool to automatically collect and submit kernel crash signatures
  keyboard-setup.service                   loaded    active   exited  Set the console keyboard layout

root@JetsonNano:~# jtop

<jtop>

2019년 11월 27일 수요일

TensorRT(High speed inference engine) - 2. Reduce loading time for TensrRT models

I used JetsonSeries(Nano, TX2), Ubuntu 18.04 Official image with root account.
This article introduces the contents of JK Jung's blog. It takes a lot of time to load the network model in TensorFlow. More time(extreamely long time) is required to load the TensorRT model in Tensorflow.

He saids
"
When I first tried out TensorRT integration in TensorFlow (TF-TRT) a few months ago, I encountered this “extremely long model loading time problem” with tensorflow versions 1.9, 1.10, 1.11 and 1.12. This problem has since been reported multiple times on NVIDIA Developer Forum: for example, here, here, and here. As a result, I was forced to use an older version of tensorflow which could suffer from incompatibility with models trained with newer version of tensorflow…
Thanks to Dariusz, one of the readers, I now have a solution to the problem.
"

How to fix the “extremely long model loading time problem of TF-TRT”

The root cause of the problem is: the default ‘python implementation’ of python3 ‘protobuf’ module runs too slowly on the Jetson platforms. And the solution is simply to replace it with ‘cpp implementaion’ of that same module.

First check the protobuf version of your Jetson Series. In JK Jung's blog, his protobuf version is 3.6.1. But Jetpack is constantly being upgraded, so you need to check the protobuf version of your Jetson at this time.

spypiggy@spypiggy-desktop:~$ pip3 show protobuf
Name: protobuf
Version: 3.9.1
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/
Author: None
Author-email: None
License: 3-Clause BSD License
Location: /usr/local/lib/python3.6/dist-packages
Requires: six, setuptools

Download the script file and modify the version

First download the script file and modify the script file to replace the version number.

$ wget https://raw.githubusercontent.com/jkjung-avt/jetson_nano/master/install_protobuf-3.6.1.sh -O install_protobuf.sh
$ vim install_protobuf.sh

In my case, version is 3.9.1, so I replaced "3.6.1" to "3.9.1".

#!/bin/bash

set -e

folder=${HOME}/src
mkdir -p $folder

echo "** Download protobuf-3.9.1 sources"
cd $folder
if [ ! -f protobuf-python-3.9.1.zip ]; then
  wget https://github.com/protocolbuffers/protobuf/releases/download/v3.9.1/protobuf-python-3.9.1.zip
fi
if [ ! -f protoc-3.9.1-linux-aarch_64.zip ]; then
  wget https://github.com/protocolbuffers/protobuf/releases/download/v3.9.1/protoc-3.9.1-linux-aarch_64.zip
fi

echo "** Install protoc"
unzip protobuf-python-3.9.1.zip
unzip protoc-3.9.1-linux-aarch_64.zip -d protoc-3.9.1
sudo cp protoc-3.9.1/bin/protoc /usr/local/bin/protoc

echo "** Build and install protobuf-3.9.1 libraries"
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp
cd protobuf-3.9.1/
./autogen.sh
./configure --prefix=/usr/local
make
make check
sudo make install
sudo ldconfig

echo "** Update python3 protobuf module"
# remove previous installation of python3 protobuf module
sudo pip3 uninstall -y protobuf
sudo pip3 install Cython
cd python/
# force compilation with c++11 standard
sed -i '205s/if v:/if True:/' setup.py
python3 setup.py build --cpp_implementation
python3 setup.py test --cpp_implementation
sudo python3 setup.py install --cpp_implementation

echo "** Build protobuf-3.9.1 successfully"

Run the script

./install_protobuf.sh

The script would take a while to finish, take a coffee break time.

Wrapping up

With the python code, it just takes very long to deserialize the trt pb file. The solution is to use “C++ implementation (with Cython wrapper)” of the python3 protobuf module.
With the solution applied, the optimized tensorrt pb file load time can be shortened.

2019년 11월 25일 월요일

TensorRT(High speed inference engine) - 1. conversion Tensorflow model to TensorRT

I used JetsonSeries(Nano, TX2), Ubuntu 18.04 Official image with root account.
This article quotes much from "How to inspect a pre-trained TensorFlow model", and "Speed up TensorFlow Inference on GPUs with TensorRT". And I consulted a lot of official documents from NVODIA.

Brief introduction about TensorRT

NVIDIA® TensorRT™ is a deep learning platform that optimizes neural network models and speeds up for inference across GPU-accelerated platforms running in the datacenter, embedded and automotive devices. Jetson Series does not need to install TensorRT separately because the image is provided with TensorRT installed.
The following figure shows TensorRT defined as part high-performance inference optimizer and part runtime engine. It can take in neural networks trained on these popular frameworks, optimize the neural network computation, generate a light-weight runtime engine (which is the only thing you need to deploy to your production environment), and it will then maximize the throughput, latency, and performance on these GPU platforms.

It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result (a process that is referred to in various places as scoring, detecting, regression, or inference). Some training frameworks such as TensorFlow have integrated TensorRT so that it can be used to accelerate inference within the framework. Alternatively, TensorRT can be used as a library within a user application. It includes parsers for importing existing models from Caffe, ONNX, or TensorFlow, and C++ and Python APIs for building models programmatically.

TensorRT performs several important transformations and optimizations to the neural network graph (Fig 2). First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective output. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters. Note that these graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently.

Figure 2 (a): An example convolutional neural network with multiple convolutional and activation layers. (b) TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead.Figure 2 (a): An example convolutional neural network with multiple convolutional and activation layers. (b) TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead.

TensorRT optimizes the largest subgraphs possible in the TensorFlow graph. The more compute in the subgraph, the greater benefit obtained from TensorRT. You want most of the graph optimized and replaced with the fewest number of TensorRT nodes for best performance. Based on the operations in your graph, it’s possible that the final graph might have more than one TensorRT node.

With the TensorFlow API, you can specify the minimum number of nodes in a subgraph for it to be converted to a TensorRT node. Any sub-graph with less than the specified number of nodes will not be converted to TensorRT engines even if it is compatible with TensorRT. This can be useful for models containing small compatible sub-graphs separated by incompatible nodes, in turn leading to tiny TensorRT engines.

Tensorflow TensorRT Workflow

The following diagram shows the typical workflow in deploying a trained model for inference.

Figure 1. Deploying a trained model workflow.

In order to optimize the model using TF-TRT, the workflow changes to one of the following diagrams depending on whether the model is saved in SavedModel format or regular checkpoints. Optimizing with TF-TRT is the extra step that is needed to take place before deploying your model for inference.

Figure 2. Showing the SavedModel format.

Figure 3. Showing a Frozen graph.

Conversion API

The original Python function create_inference_graph that was used in TensorFlow 1.13 and earlier is deprecated in TensorFlow >1.13 and removed in TensorFlow 2.0.
So use TrtGraphConverter(tensorflow ver 1.x), TrtGraphConverterV2 (tensorflow ver 2.x).

TrtGraphConverter API

input_saved_model_dir: Default value is None. This is the directory to load the SavedModel which contains the input graph to transforms and is used only when input_graph_def is None.
input_saved_model_tags: Default value is None. This is a list of tags to load the SavedModel.
input_saved_model_signature_key: Default value is None. This is the key of the signature to optimize the graph for.
input_graph_def: Default value is None. This is a GraphDef object containing a model to be transformed. If set to None, the graph will be read from the SavedModel loaded from input_saved_model_dir.
nodes_blacklist: Default value is None. This is a list of node names to prevent the converter from touching.
session_config: Default value is None. This is the ConfigProtoused to create a Session. It's also used as a template to create a TRT-enabled ConfigProto for conversion. If not specified, a default ConfigProto is used.
max_batch_size: Default value is 1. This is the max size for the input batch.
max_workspace_size_bytes: Default value is 1GB. This is the maximum GPU temporary memory which the TensorRT engine can use at execution time. This corresponds to the workspaceSize parameter of nvinfer1::IBuilder::setMaxWorkspaceSize().
precision_mode: Default value is TrtPrecisionMode.FP32. This is one of TrtPrecisionMode.supported_precision_modes() , in other words, "FP32", "FP16" or "INT8" (lowercase is also supported).
minimum_segment_size: Default value is 3. This is the minimum number of nodes required for a subgraph to be replaced by TRTEngineOp.
is_dynamic_op: Default value is False. Whether to generate dynamic TensorRT ops which will build the TensorRT network and engine at run time.
maximum_cached_engines: Default value is 1. This is the max number of cached TensorRT engines in dynamic TensorRT ops. If the number of cached engines is already at max but none of them can serve the input, the TRTEngineOp will fall back to run the TensorFlow function based on which the TRTEngineOp is created.
use_calibration: Default value is True. This argument is ignored if precision_mode is not INT8. If set to True, a calibration graph will be created to calibrate the missing ranges. The calibration graph must be converted to an inference graph by running calibration with calibrate(). If set to False, quantization nodes will be expected for every tensor in the graph (excluding those which will be fused). If a range is missing, an error will occur.

Note: Accuracy may be negatively affected if there is a mismatch between which tensors TensorRT quantizes and which tensors were trained with fake quantization.
use_function_backup: Default value is True. If set to True, it will create a FunctionDef for each subgraph that is converted to TensorRT op, and if TensorRT ops fail to execute at runtime, it'll invoke that function as a fallback.

The main methods you can use in the TrtGraphConverter class are the following:

TrtGraphConverter.convert()

This method runs the conversion and returns the converted GraphDef. The conversion and optimization that are performed depends on the arguments passed to the constructor as explained above.

In dynamic mode, where the TensorRT engines are built at runtime, this method only segments the graph in order to separate the TensorRT subgraphs, i.e. optimizing each TensorRT subgraph happens later during runtime. In static mode, the optimization also happens in this method and thus this method becomes time consuming.

TrtGraphConverter.calibrate(fetch_names, num_runs, feed_dict_fn, input_map_fn)

This method runs the INT8 calibration and returns the calibrated GraphDef. This method should be called after convert() in order to execute the calibration on the converted graph. The method accepts the following arguments:

fetch_names: A list of output tensor name to fetch during calibration.
num_runs: Number of runs of the graph during calibration.
feed_dict_fn: A function that returns a dictionary mapping input names (as strings) in the GraphDef to be calibrated to values (e.g. Python list, NumPy arrays, etc).One and only one of feed_dict_fn and input_map_fn should be specified.
input_map_fn: A function that returns a dictionary mapping input names (as strings) in the GraphDef to be calibrated to Tensor objects. The values of the named input tensors in the GraphDef to be calibrated will be re-mapped to the respective Tensor values during calibration. One and only one of feed_dict_fn and input_map_fn should be specified.

TrtGraphConverter.save(output_saved_model_dir)

This method saves the converted graph as a SavedModel.

TrtGraphConverterV2 API

input_saved_model_dir

Default value is None. This is the directory to load the SavedModel which contains the input graph to transforms.

input_saved_model_tags

Default value is None. This is a list of tags to load the SavedModel.

input_saved_model_signature_key

Default value is None. This is the key of the signature to optimize the graph for.

conversion_params

Default value is DEFAULT_TRT_CONVERSION_PARAMS. An instance of namedtupleTrtConversionParams consisting the following items:

rewriter_config_template: A template RewriterConfig proto used to create a TRT-enabled RewriterConfig. If None, it will use a default one.
max_workspace_size_bytes: Default value is 1GB. The maximum GPU temporary memory which the TensorRT engine can use at execution time. This corresponds to the workspaceSize parameter of nvinfer1::IBuilder::setMaxWorkspaceSize().
precision_mode: Default value is TrtPrecisionMode.FP32. This is one of TrtPrecisionMode.supported_precision_modes(), in other words, FP32, FP16 or INT8 (lowercase is also supported).
minimum_segment_size: Default value is 3. This is the minimum number of nodes required for a subgraph to be replaced by TRTEngineOp.
maximum_cached_engines: Default value is 1. This is the maximum number of cached TensorRT engines in dynamic TensorRT ops. If the number of cached engines is already at max but none of them can serve the input, the TRTEngineOp will fall back to run the TensorFlow function based on which the TRTEngineOp is created.
use_calibration: Default value is True. This argument is ignored if precision_mode is not INT8. If set to True, a calibration graph will be created to calibrate the missing ranges. The calibration graph must be converted to an inference graph by running calibration with calibrate(). If set to False, quantization nodes will be expected for every tensor in the graph (excluding those which will be fused). If a range is missing, an error will occur.

Note: Accuracy may be negatively affected if there is a mismatch between which tensors TensorRT quantizes and which tensors were trained with fake quantization.

The main methods you can use in the TrtGraphConverter class are the following:

TrtGraphConverter.convert(calibration_input_fn): This method runs the conversion and returns the converted TensorFlow function (note that this method returns the converted GraphDef in TensorFlow 1.x). The conversion and optimization that are performed depends on the arguments passed to the constructor as explained above.

This method only segments the graph in order to separate the TensorRT subgraphs, i.e. optimizing each TensorRT subgraph happens later during runtime (in TensorFlow 1.x this behaviour depends on is_dynamic_mode but this argument is not supported in TensorFlow 2.0 anymore; i.e. only is_dynamic_op=True is supported).

This method has only one optional argument which should be used in case INT8 calibration is desired. The argument calibration_input_fn is a generator function that yields input data as a list or tuple, which will be used to execute the converted signature for INT8 calibration. All the returned input data should have the same shape. Note that in TensorFlow 1.x, the INT8 calibration was performed using the separate method calibrate() which is removed from TensorFlow 2.0.
TrtGraphConverter.build(input_fn): This method optimizes the converted function (returned by convert()) by building TensorRT engines. This is useful in case the user wants to perform the optimizations before runtime. The optimization is done by running inference on the converted function using the input data received from the argument input_fn. This argument is a generator function that yields input data as a list or tuple.
TrtGraphConverter.save: This method saves the converted function as a SavedModel. Note that the saved TensorFlow model is still not optimized yet with TensorRT (engines are not built) in case build() is not called

Conversion Example

If you have a frozen graph of your TensorFlow model, you first need to load the frozen graph file and parse it to create a deserialized GraphDef. Then you can use the GraphDef to create a TensorRT inference graph, for example:

This example reads a tensorflow model and then converts it to a TensorRT model(trt_graph)

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
with tf.Session() as sess:
    # First deserialize your frozen graph:
    with tf.gfile.GFile(“/path/to/your/frozen/graph.pb”, ‘rb’) as f:
        frozen_graph = tf.GraphDef()
        frozen_graph.ParseFromString(f.read())
    # Now you can create a TensorRT inference graph from your
    # frozen graph:
    converter = trt.TrtGraphConverter(
     input_graph_def=frozen_graph,
     nodes_blacklist=['logits', 'classes']) #output nodes
    trt_graph = converter.convert()
    # Import the TensorRT graph into a new graph and run:
    output_node = tf.import_graph_def(
        trt_graph,
        return_elements=['logits', 'classes'])
    sess.run(output_node)

<https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#using-frozengraph>

Be careful : Only use TrtGraphConverter with TensorFlow version 1.14 or later. In version 1.13 and earlier, use "create_inference_graph" function instead.

Everything except the nodes_blacklist(output nodes) seems to be find. What is the nodes_blacklist? And how can I find that?

To find this output node value, you must inspect the tensorflow model first. This value is the last node of the model.

Viewing the model in TensorBoard

To get the last node information, you need to view the pb file(frozen graph).

This is a simple python that creates log. Run this code to create the log files.

import argparse
import tensorflow as tf

parser = argparse.ArgumentParser(description='tf-model-conversion to TensorRT')
parser.add_argument('--model_dir', type=str, default='')
parser.add_argument('--log_dir', type=str, default='')
args = parser.parse_args()


with tf.Session() as sess:
    model_filename = args.model_dir
    with tf.gfile.GFile(model_filename, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        g_in = tf.import_graph_def(graph_def)
LOGDIR=args.log_dir
train_writer = tf.summary.FileWriter(LOGDIR)
train_writer.add_graph(sess.graph)
train_writer.flush()
train_writer.close()

<pb_viewer.py>

I'll use tf-pose-estimation example. Please see my other article. Now there's a new log file in the ./logs directory that describes graph_opt.pb file.

# python3 pb_viewer.py --model_dir ./models/graph/cmu/graph_opt.pb --log_dir ./logs
# ls -al logs
total 204436
drwxr-xr-x  2 root root      4096 11월 25 23:16 .
drwxr-xr-x 16 root root      4096 11월 25 23:15 ..
-rw-r--r--  1 root root 209333789 11월 25 23:16 events.out.tfevents.1574691381.spytx-desktop

Now run Tensorboard like this. If successful, last log would be the URL of the tensorboard you started.

root@spytx-desktop:/work/src/pose_estimation/tf-pose-estimation# tensorboard --logdir=/work/src/pose_estimation/tf-pose-estimation/logs
.........
.........
TensorBoard 1.14.0 at http://spytx-desktop:6006/ (Press CTRL+C to quit)

Then access tensorboard from your desktop using a web browser. Because the log file is quite large, it may take a while for the browser screen to load.

When you first open the model in TensorBoard, you’ll just see one node called “import”. At the top right you’ll see “Subgraph: 276 nodes”, which is the hint that there is more to see.

You can double click on the “import” node and it will expand to show you the full graph. You’ll need to zoom out and scroll around to get it to fit nicely, but you should be able to see the full graph on one screen:

You can also zoom in and see the output node at the top of the screen:

In my other article , I made python code like this. Outpur node value "Openpose/concat_stage7" comes from above picture.

# convert (optimize) frozen model to TensorRT model
your_outputs = ["Openpose/concat_stage7"]

start = time.time()
trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,# frozen model
    outputs=your_outputs,
    is_dynamic_op=True,
    minimum_segment_size=3,
    maximum_cached_engines=int(1e3),
    max_batch_size=1,# specify your max batch size
    max_workspace_size_bytes=2*(10**9),# specify the max workspace
    precision_mode="FP16") # precision, can be "FP32" (32 floating point precision) or "FP16"

Save the TensorRT model

Modify the directory of models, outputs for your purpose. This code is created to save the TensorRT model of my article "Human Pose estimation using tensorflow"

import argparse
import sys, os
import time
import tensorflow as tf

ver=tf.__version__.split(".")
if(int(ver[0]) == 1 and int(ver[1]) <= 13):
#if tensorflow vereion <= 1.13.1 use this module
    print('tf Version <= 1.13')
    import tensorflow.contrib.tensorrt as trt
else:
#if tensorflow vereion > 1.13.1 use this module instead
    print('tf Version > 1.13')
    from tensorflow.python.compiler.tensorrt import trt_convert as trt



def get_frozen_graph(graph_file):
  """Read Frozen Graph file from disk."""
  with tf.gfile.FastGFile(graph_file, "rb") as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
  return graph_def


parser = argparse.ArgumentParser(description='tf network model conversion to tensorrt')
parser.add_argument('--model', type=str, default='mobilenet_v2_small',
                        help='cmu / mobilenet_thin / mobilenet_v2_large / mobilenet_v2_small')
args = parser.parse_args()

model_dir = 'models/graph/'
frozen_name = model_dir + args.model + '/graph_opt.pb'
frozen_graph = get_frozen_graph(frozen_name)
print('=======Frozen Name:%s======='%(frozen_name));
# convert (optimize) frozen model to TensorRT model
your_outputs = ["Openpose/concat_stage7"]

start = time.time()

if(int(ver[0]) == 1 and int(ver[1]) <= 13):
    trt_graph = trt.create_inference_graph(
        input_graph_def=frozen_graph,# frozen model
        outputs=your_outputs,
        is_dynamic_op=True,
        minimum_segment_size=3,
        maximum_cached_engines=int(1e3),
        max_batch_size=1,# specify your max batch size
        max_workspace_size_bytes=2*(10**9),# specify the max workspace
        precision_mode="FP16") # precision, can be "FP32" (32 floating point precision) or "FP16"
else:
    converter =trt.TrtGraphConverter(
        input_graph_def=frozen_graph,# frozen model
        max_batch_size=1,
        precision_mode="FP16",
        minimum_segment_size=3,
        is_dynamic_op=True,
        nodes_blacklist=your_outputs)
    trt_graph = converter.convert()

elapsed = time.time() - start
print('Tensorflow model => TensorRT model takes : %f'%(elapsed));

#write the TensorRT model to be used later for inference
rt_name = model_dir + args.model + '/graph_opt_rt.pb'
with tf.gfile.FastGFile(rt_name , 'wb') as f:
    f.write(trt_graph.SerializeToString())

Wrapping up

After creating a log in the TensorFlow model, you can use the TensorBoard to find the final output node information. And use the TrtGraphConverter or create_inference_graph function to build a model for TensorRT using this information. I'll continue this story in the next article.

JetsonNano - Human Pose estimation using tensorflow (Boost up the FPS using TensorRT)

I used Jetson Nano, Ubuntu 18.04 Official image with root account. Please read my previous article first.

Prerequisites

Before you build "ildoonet/tf-pose-estimation", you must pre install these packages. See the URLs.

OpenCV : https://spyjetson.blogspot.com/2019/09/jetsonnano-opencv-411-build.html
tensorflow : https://spyjetson.blogspot.com/2019/09/jetsonnano-installing-tensorflow-114.html
JetsonNano - Human Pose estimation using tensorflow (My another article): https://spyjetson.blogspot.com/2019/09/jetsonnano-human-pose-estimation-using.html

Performance Compare

I used a lightweight Pose Estimation Tensorflow framework (https://github.com/ildoonet/tf-pose-estimation) in my previous post. The framework provides four network models and can be selected to match the performance of the machines used. Network models are stored in or will be stored in the models / graph directory.

The following table compares performance with and without TensorRT.

I execute these commands to get the above result.

python3 run_webcam.py --model=mobilenet_thin --resize=368x368
python3 run_webcam.py --model=mobilenet_large --resize=368x368
python3 run_webcam.py --model=mobilenet_v2_large --resize=368x368
python3 run_webcam.py --model=mobilenet_v2_small --resize=368x368
python3 run_webcam.py --model=mobilenet_thin --resize=368x368  --tensorrt=True
python3 run_webcam.py --model=mobilenet_large --resize=368x368  --tensorrt=True
python3 run_webcam.py --model=mobilenet_v2_large --resize=368x368  --tensorrt=True
python3 run_webcam.py --model=mobilenet_v2_small --resize=368x368  --tensorrt=True

python3 run_webcam.py --model=mobilenet_thin --resize=160x160
python3 run_webcam.py --model=mobilenet_large --resize=160x160
python3 run_webcam.py --model=mobilenet_v2_large --resize=160x160
python3 run_webcam.py --model=mobilenet_v2_small --resize=160x160
python3 run_webcam.py --model=mobilenet_thin --resize=160x160  --tensorrt=True
python3 run_webcam.py --model=mobilenet_large --resize=160x160  --tensorrt=True
python3 run_webcam.py --model=mobilenet_v2_large --resize=160x160  --tensorrt=True
python3 run_webcam.py --model=mobilenet_v2_small --resize=160x160  --tensorrt=True

Under the hood

Network model load time

The larger the network model, the longer it takes to load the model in TensorFlow. And with TensorRT, this time is even longer. However, with TensorRT, you can save time processing inference images once you finish reading the network model. In particular, large and complex models such as CMUs can produce large performance differences.

TensorRT network model processing

The source codes at https://github.com/ildoonet/tf-pose-estimation generate a new model for TensorRT every time when using TensorRT. If the model for Tensorflow is converted to TensorRT in advance, unnecessary steps can be omitted. The Python code that converts for TensorRT in real time is in the tf_pose / estimator.py file.

<tf_pose / estimator.py>

The create_inference_graph function takes a model for TensorFlow as an input parameter and creates a model for TensorRT. Create a network model for TensorRT in advance and modify it so that the model can be read immediately.

Be careful :Models for TensorRT are incompatible with different versions of TensorRT. Therefore, it is the safest way to save and work on the Jetson Series you are currently using.

Convert Tensorflow model for TensorRT

I made the following Python code to convert the Tensorflow model for TensorRT.

import argparse

import sys, os
import time
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
#from tf_pose import common
#import cv2


def get_frozen_graph(graph_file):
  """Read Frozen Graph file from disk."""
  with tf.gfile.FastGFile(graph_file, "rb") as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
  return graph_def


parser = argparse.ArgumentParser(description='tf network model conversion to tensorrt')
parser.add_argument('--model', type=str, default='mobilenet_v2_small',
                        help='cmu / mobilenet_thin / mobilenet_v2_large / mobilenet_v2_small')
args = parser.parse_args()

model_dir = 'models/graph/'
frozen_name = model_dir + args.model + '/graph_opt.pb'
frozen_graph = get_frozen_graph(frozen_name)
print('=======Frozen Name:%s======='%(frozen_name));
# convert (optimize) frozen model to TensorRT model
your_outputs = ["Openpose/concat_stage7"]

start = time.time()
trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,# frozen model
    outputs=your_outputs,
    is_dynamic_op=True,
    minimum_segment_size=3,
    maximum_cached_engines=int(1e3),
    max_batch_size=1,# specify your max batch size
    max_workspace_size_bytes=2*(10**9),# specify the max workspace
    precision_mode="FP16") # precision, can be "FP32" (32 floating point precision) or "FP16"

elapsed = time.time() - start
print('Tensorflow model => TensorRT model takes : %f'%(elapsed));

#write the TensorRT model to be used later for inference
rt_name = model_dir + args.model + '/graph_opt_rt.pb'
with tf.gfile.FastGFile(rt_name , 'wb') as f:
    f.write(trt_graph.SerializeToString())

<tf_model_2_rt.py>

Run the code to make TensorRT model. If successful, you can see the screen like this.

python3 tf_model_2_rt.py --model=cmu

2019-11-24 23:52:26.250140: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 1
2019-11-24 23:52:27.198957: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-24 23:52:28.314413: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 470 nodes succeeded.
2019-11-24 23:52:28.492216: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] Optimization results for grappler item: tf_graph
2019-11-24 23:52:28.492379: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 468 nodes (0), 478 edges (0), time = 1715.36804ms.
2019-11-24 23:52:28.492444: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   layout: Graph size after: 478 nodes (10), 484 edges (6), time = 240.81ms.
2019-11-24 23:52:28.492489: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 473 nodes (-5), 484 edges (0), time = 505.675ms.
2019-11-24 23:52:28.492532: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   TensorRTOptimizer: Graph size after: 4 nodes (-469), 4 edges (-480), time = 2957.31ms.
Tensorflow model => TensorRT model takes : 16.847830

And there's a new pb file(graph_opt_rt.pb) at model/graph/cmu directory. Run this python for other network models to make TensorRT models.

root@spytx-desktop:/work/src/pose_estimation/tf-pose-estimation# ls -al models/graph/cmu/
total 613232
drwxr-xr-x 2 root root      4096 11월 24 00:21 .
drwxr-xr-x 6 root root      4096 11월  9 21:07 ..
-rw-r--r-- 1 root root       643 11월  9 21:07 download.sh
-rw-r--r-- 1 root root 209299198 11월 10 13:31 graph_opt.pb
-rw-r--r-- 1 root root 418623762 11월 24 23:52 graph_opt_rt.pb
-rw-r--r-- 1 root root         0 11월  9 21:07 __init__.py

Modify the estimator.py to use a ready made TensorRT models.

I modified the estimator.py to open a ready made TensorRT models.

You can download the estimator.py file(Only used on Jetson series) at my repo.

Wrapping up

Processing complex and large network models using TensorRT can provide significant speedups. In Pose Estimation, it can be seen that the cmu model has a large performance gain when using TensorRT. Soon, I will test the above results on TX2. And I will cover the conversion of a generic Tensorflow network model for TensorRT soon.

If you want the most satisfactory human pose estimation performance on Jetson Nano, see the following article(https://spyjetson.blogspot.com/2019/12/jetsonnano-human-pose-estimation-using.html). NVIDIA team introduces human pose estimation using models optimized for TensorRT.