TensorRT is a platform for high-performance deep learning inference that can be used to optimize trained models. This is done by replacing TensorRT-compatible subgraphs with a single TRTEngineOp that is used to build a TensorRT engine. These engines are a network of layers and have well defined input shapes. They run inference using the TensorRT libraries (see Conversion Parameters for more details). After a model is optimized with TensorRT, the traditional Tensorflow workflow is still used for inferencing, including TensorFlow Serving.
TensorRT-compatible subgraphs consist of TensorFlow with TensorRT (TF-TRT) supported ops (see Supported Ops for more details) and are directed acyclic graphs (DAGs). Tensorflow ops that are not compatible with TF-TRT, including custom ops, are run using Tensorflow.
TensorRT can also calibrate for lower precision (FP16 and INT8) with a minimal loss of accuracy. Using a lower precision mode reduces the requirements on bandwidth and allows for faster computation speed. It also allows for the use of Tensor Cores, which perform matrix multiplication on 4×4 FP16 matrices and adds a 4×4 FP16 or FP32 matrix.
This tutorial explains how to convert a model to a TensorRT-optimized model, some of the parameters that can be used for the conversion, how to run an upstream example in the WLM CE environment, and compares statistics between native and TensorRT-optimized runs.
Note: TensorRT engines are optimized for the currently available GPUs. So, conversions should take place on the system that will be running inference.
Optimizing pre-trained models
In this section, we used the ResNet-50 v2 (fp32) model from the official TensorFlow models repository saved into the /tmp/resnet directory.
# curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz
- Saved models can be optimized by using the saved_model_cli script included with the TensorFlow conda package:
# saved_model_cli convert --dir /tmp/resnet/1538687457/ --output_dir /home/user/example/4/ --tag_set serve tensorrt --is_dynamic=True
- Saved models and frozen graphs can also be optimized by using the TensorFlow Python TrtGraphConverter class.
- For saved models, you need to pass in input_saved_model_dir=dir, where dir/saved_model.pb exists.
from tensorflow.python.compiler.tensorrt import trt_convert as trt
# Convert a saved model
converter = trt.TrtGraphConverter(input_saved_model_dir='/tmp/resnet/1538687457/')
graph_def = converter.convert()
converter.save('/home/user/example/1/')
- For frozen graphs, you need to pass in input_graph_def and nodes_blacklist parameters. nodes_blacklist is a list of output nodes.
Because this example model is in the saved model format, we need to create a frozen graph:
freeze_graph --input_saved_model_dir=/tmp/resnet/1538687457/ --output_graph=/tmp/resnet/frozen_graph.pb --saved_model_tags serve --output_node_names=softmax_tensor
Next, we load the frozen graph into a TensorFlow GraphDef:
import tensorflow as tf
# Load and convert a frozen graph
graph_def = tf.GraphDef()
with tf.gfile.GFile("/tmp/resnet/frozen_graph.pb", 'rb') as f:
graph_def.ParseFromString(f.read())
Finally, we optimize the frozen graph using TensorRT:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverter(input_graph_def=graph_def, nodes_blacklist=['softmax_tensor'])
graph_def = converter.convert()
converter.save('/home/user/example/2/')
- When using INT8 precision mode, an additional calibration step is required to finish the optimization. The calibration data set should be representative of the problem data set. For information about INT8 calibration see NVIDIA’s 8-bit Inference with TensorRT.
# Get calibration data
import requests
IMAGE_URL = 'https://tensorflow.org/images/blogs/serving/cat.jpg'
data = requests.get(IMAGE_URL, stream=True).content
# Convert and calibrate model
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import numpy as np
converter = trt.TrtGraphConverter(input_saved_model_dir='/tmp/resnet/1538687457/', precision_mode='INT8')
converted_graph_def = converter.convert()
calibrated_graph_def = converter.calibrate(
fetch_names=['softmax_tensor'],
num_runs=1,
feed_dict_fn=lambda: {'input_tensor:0': np.array([data])}
)
converter.save('/home/user/example/3/')
The calibrate function accepts either feed_dict_fn or input_map_fn for mapping input tensors to data.
Conversion parameters
There are additional parameters that can be passed to saved_model_cli
and TrtGraphConverter
:
precision_mode
: The precision mode to use (FP32, FP16, or INT8)
minimum_segment_size
: The minimum number of TensorFlow nodes required for a TensorRT subgraph to be valid.
is_dynamic_op
: TensorRT engines are converted and built at model run time instead of during the converter.convert() call. This is required if there are tensors with unknown or dynamic shapes.
use_calibration
: Only used if precision_mode='INT8'. If True, a calibration graph will be created, and converter.calibrate() should be called. This is the recommended option. If False, all tensors that will not be fused must have quantization nodes. See NVIDIA’s INT8 Quantization for details.
max_batch_size
: Used when is_dynamic_op=False. This is the maximum batch size for TensorRT engines. At run time, smaller batch sizes can be used, but a larger batch size will result in an error.
maximum_cached_engines
: Used when is_dynamic_op=True. This limits the number of TensorRT engines that are cached, per TRTEngineOp.
Running the object detection example
Image classification and object detection examples can be found at github.com/tensorflow/tensorrt. The object detection example provides performance output for various models and configurations with and without TensorRT.
- Get the example source code (this was verified with commit 3ddfab).
# git clone https://github.com/tensorflow/tensorrt --recursive
- Set up the environment:
# conda create -n tf-trt tensorflow-gpu requests pillow cython -y
# conda activate tf-trt
# cd tensorrt
# pushd tftrt/examples/object_detection
# ./install_dependencies.sh
# popd
- Download the coco validation data set:
# python
>>> from tftrt.examples.object_detection import download_dataset
>>> download_dataset('val2017', output_dir='coco')
- Create a test.json file.
{
"model_config": {
"model_name": "ssd_inception_v2_coco",
"output_dir": "models"
},
"optimization_config": {
"use_trt": true,
"precision_mode": "FP16"
},
"benchmark_config": {
"images_dir": "coco/val2017",
"annotation_path": "coco/annotations/instances_val2017.json",
"batch_size": 1,
"image_shape": [600, 600],
"num_images": 2048,
"output_path": "stats/ssd_inception_v2_coco_trt_fp16.json"
}
}
- For additional test configuration options, run:
# python
>>> import tftrt.examples.object_detection as object_detection
>>> help(object_detection.test)
>>> help(object_detection.optimize_model)
>>> help(object_detection.benchmark_model)
- Run the test:
python -m tftrt.examples.object_detection.test test.json
- Below are results from three different runs of the object_detection example: native (no TensorRT), FP32 (TensorRT optimized), and FP16 (TensorRT optimized). The TensorRT optimized models show an increase in performance with minimal to no loss of precision. These results were gathered on an IBM Power® System AC922 server with 16 GB NVIDIA Tesla V100 GPUs:
|
Native |
FP32 |
FP16 |
avg_latency_ms |
19.106557061364345 |
14.386320138001466 |
13.284107108970543 |
avg_throughput_fps |
52.3380532027989 |
69.51047873309172 |
75.27792359674037 |
map |
0.273695258874402 |
0.273695258874402 |
0.273 |
Summary
Converting a model to a TensorRT optimized model is a straightforward process and can enhance performance with little to no loss of accuracy. The image classification and object detection examples can be easily run to compare the performance of different models, with or without TensorRT.
Originally published on IBM Developer by Taylor Jakobson