Nvidia™ Triton Server inference engine

The Nvidia™ Triton Server is an open-source inference service software that enables the user to deploy trained AI models from any framework on GPU or CPU infrastructure. It supports all major frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime, and even custom framework backend. With specific backends, it is also possible to run Python scripts, mainly for pre-and post-processing purposes, and exploit the DALI building block for optimized operations. For more detail about the Triton Server, please refer to the official website.

The ESF Triton Server component is the implementation of the inference engine APIs and provides methods for interacting with a local or remote Nvidia™ Triton Server. As presented below, the component enables the user to configure a local server running on the gateway or to communicate to an external server to load specific models.

The parameters used to configure the Triton Service are the following:

Nvidia Triton Server address: the address of the Nvidia Triton Server.
Nvidia Triton Server ports: the ports used to connect to the server for HTTP, GRPC, and Metrics services.
Inference Models: a comma-separated list of inference model names that the server will load. The models have to be already present in the filesystem where the server is running. This option simply tells the server to load the given models from a local or remote repository.
Local Nvidia Triton Server: If enabled, a local native Nvidia Triton Server is started on the gateway. In this case, the model repository and backends path are mandatory. Moreover, the server address property is overridden and set to localhost. Be aware that the Triton Server has to be already installed on the system.
Local model repository path: Only for a local instance, specify the path on the filesystem where the models are stored.
Local backends path: Only for a local instance, specify the path on the filesystem where the backends are stored.
Optional configuration for the local backends: Only for local instance, a semi-colon separated list of configuration for the backends. i.e. tensorflow,version=2;tensorflow,allow-soft-placement=false
Timeout (in seconds) for time consuming tasks: Timeout (in seconds) for time consuming tasks like server startup, shutdown or model load. If the task exceeds the timeout, the operation will be terminated with an error.
Max. GRPC message size (bytes): this field controls the maximum allowed size for the GRPC calls to the server instance. By default, size of 4194304 bytes (= 4.19 MB) is used. Increase this value to be able to send large amounts of data as input to the Triton server (like Full HD images). The Kura logs will show the following error when exceeding such limit:
```
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 4194304
```

🚧
Pay attention on the ports used for communicating with the Triton Server. The default ports are the 8000-8002, but these are tipically used by ESF for debug purposes.

Nvidia™ Triton Server installation

Before running ESF's Triton Server Service, you must install the Triton Inference Server. Here you can find the necessary steps for the two suggested installation methods.

Native Triton installation on Jetson devices

A release of Triton for JetPack is provided in the tar file in the Triton Inference Server release notes. Full documentation is available here.

Installation steps:

Before running the executable you need to install the Runtime Dependencies for Triton.
After doing so you can extract the tar file and run the executable in the bin folder.
It is highly recommended to add the tritonserver executable to your path or symlinking the executable to /usr/local/bin.

Triton Docker image installation

Before you can use the Triton Docker image you must install Docker. If you plan on using a GPU for inference you must also install the NVIDIA Container Toolkit.

Pull the image using the following command.

$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3

Where <xx.yy> is the version of Triton that you want to pull.

Triton Server setup

The Triton Inference Server serves models from one or more model repositories that are specified when the server is started. The model repository is the directory where you place the models that you want Triton to serve. Be sure to follow the instructions to setup the model repository directory.

Further information about an example Triton Server setup can be found in the official documentation.

Configuration for a local native Triton Server

🚧
Requirement: tritonserver executable needs to be available in the path to the kurad user. Be sure to have a working Triton Server installation before configuring the local native Triton Server instance through ESF UI.

When the Local Nvidia Triton Server option is set to true, a local instance of the Nvidia™ Triton Server is started on the gateway. The following configuration is required:

Nvidia Triton Server address: localhost
Nvidia Triton Server ports: <mandatory>
Inference Models: <mandatory>. Note that the models have to be already present on the filesystem.
Local Nvidia Triton Server: true
Local model repository path: <mandatory>
Local backends path: <mandatory>

The typical command used to start the Triton Server is like this:

tritonserver --model-repository=<model_repository_path> \
--backend-directory=<backend_repository_path> \
--backend-config=<backend_config> \
--http-port=<http_port> \
--grpc-port=<grpc_port> \
--metrics-port=<metrics_port> \
--model-control-mode=explicit \
--load-model=<model_name_1> \
--load-model=<model_name_2> \
...

Configuration for a local Triton Server running in a Docker container

If the Nvidia™ Triton Server is running as a Docker container in the gateway, the following configuration is required:

Nvidia Triton Server address: localhost
Nvidia Triton Server ports: <mandatory>
Inference Models: <mandatory>. The models have to be already present on the filesystem.
Local Nvidia Triton Server: false

In order to correctly load the models at runtime, configure the server with the --model-control-mode=explicit option. The typical command used for running the docker container is as follows. Note the forward of the ports to not interfere with ESF.

docker run --rm \
-p4000:8000 \
-p4001:8001 \
-p4002:8002 \
--shm-size=150m \
-v path/to/models:/models \
nvcr.io/nvidia/tritonserver:[version] \
tritonserver --model-repository=/models --model-control-mode=explicit

Configuration for a remote Triton Server

When the Nvidia™ Triton Server is running on a remote server, the following configuration is needed:

Nvidia Triton Server address: <mandatory>
Nvidia Triton Server ports: <mandatory>
Inference Models : <mandatory>. The models have to be already present on the filesystem.
Local Nvidia Triton Server: false

Pay attention on the ports used for communicating with the Triton Server. The default ports are the 8000-8002, but these are tipically used by ESF for debug purposes.

Nvidia™ Triton Server installation

Native Triton installation on Jetson devices

Triton Docker image installation

Triton Server setup

Configuration for a local native Triton Server

Requirement: tritonserver executable needs to be available in the path to the kurad user. Be sure to have a working Triton Server installation before configuring the local native Triton Server instance through ESF UI.

Configuration for a local Triton Server running in a Docker container

Configuration for a remote Triton Server

Requirement: `tritonserver` executable needs to be available in the path to the `kurad` user. Be sure to have a working Triton Server installation before configuring the local native Triton Server instance through ESF UI.