使用 Triton 服務 Torch-TensorRT 模型¶

在關於機器學習基礎設施的討論中，最佳化和部署是相輔相成的。完成網路層面的最佳化以獲得最大效能後，下一步就是部署模型。

然而，服務這個最佳化後的模型會帶來一系列自身的考量和挑戰，例如：構建支援併發模型執行的基礎設施、透過 HTTP 或 gRPC 支援客戶端等等。

的 Triton 推理伺服器解決了上述問題及更多。讓我們一步一步討論使用 Torch-TensorRT 最佳化模型、將其部署到 Triton 推理伺服器以及構建客戶端來查詢模型的過程。

步驟 1：使用 Torch-TensorRT 最佳化模型¶

大多數 Torch-TensorRT 使用者都熟悉此步驟。為了演示目的，我們將使用 Torchhub 中的 ResNet50 模型。

我們將在 //examples/triton 目錄下工作，該目錄包含本教程中使用的指令碼。

首先拉取 NGC PyTorch Docker 容器。您可能需要建立一個帳戶並從此處獲取 API 金鑰。註冊並使用您的金鑰登入（註冊後按照此處的說明操作）。

# YY.MM is the yy:mm for the publishing tag for NVIDIA's Pytorch
# container; eg. 24.08
# NOTE: Use the publishing tag for both the PyTorch container and the Triton Containers

docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:YY.MM-py3
cd /scratch_space

使用該容器，我們可以將模型匯出到 Triton 模型庫中的正確目錄。這個匯出指令碼使用 Torch-TensorRT 的 Dynamo 前端將 PyTorch 模型編譯到 TensorRT。然後，我們使用 TorchScript 作為 Triton 支援的序列化格式儲存模型。

import torch
import torch_tensorrt as torchtrt
import torchvision

import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= {torch_tensorrt.dtype.f16}
)

ts_trt_model = torch.jit.trace(trt_model, torch.rand(1, 3, 224, 224).to("cuda"))

# Save the model
torch.jit.save(ts_trt_model, "/triton_example/model_repository/resnet50/1/model.pt")

您可以使用以下命令執行指令碼（在 //examples/triton 目錄下）

docker run --gpus all -it --rm -v ${PWD}:/triton_example nvcr.io/nvidia/pytorch:YY.MM-py3 python /triton_example/export.py

這將把 ResNet 模型的序列化 TorchScript 版本儲存在模型庫的正確目錄中。

步驟 2：設定 Triton 推理伺服器¶

如果您是 Triton 推理伺服器的新手並想了解更多資訊，我們強烈建議您檢視我們的 Github 倉庫。

要使用 Triton，我們需要建立一個模型庫。模型庫，顧名思義，是推理伺服器託管的模型倉庫。雖然 Triton 可以服務來自多個倉庫的模型，但在本例中，我們將討論最簡單的模型庫形式。

這個倉庫的結構應該看起來像這樣

model_repository
|
+-- resnet50
    |
    +-- config.pbtxt
    +-- 1
        |
        +-- model.pt

Triton 需要兩個檔案來服務模型：模型本身和模型配置檔案，通常以 config.pbtxt 格式提供。對於我們在步驟 1 中準備的模型，可以使用以下配置

name: "resnet50"
backend: "pytorch"
max_batch_size : 0
input [
  {
    name: "x"
    data_type: TYPE_FP32
    dims: [ 1, 3, 224, 224 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [1, 1000]
  }
]

config.pbtxt 檔案用於描述準確的模型配置，包含輸入和輸出層名稱及形狀、資料型別、排程和批處理細節等資訊。如果您是 Triton 的新手，我們強烈建議您查閱我們的文件的這一部分以獲取更多詳細資訊。

完成模型庫設定後，我們可以使用下面的 docker 命令啟動 Triton 伺服器。請參考此頁面獲取容器的拉取標籤。

# Make sure that the TensorRT version in the Triton container
# and TensorRT version in the environment used to optimize the model
# are the same. Roughly, like publishing tags should have the same TensorRT version

docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/triton_example/model_repository

這應該會啟動一個 Triton 推理伺服器。下一步，構建一個簡單的 http 客戶端來查詢伺服器。

步驟 3：構建 Triton 客戶端來查詢伺服器¶

在繼續之前，請確保手頭有一個示例影像。如果您沒有，請下載一個示例影像用於測試推理。在本節中，我們將介紹一個非常基礎的客戶端。有關更多豐富示例，請參考 Triton 客戶端倉庫。

wget  -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"

然後我們需要安裝構建 python 客戶端的依賴項。這些依賴項會因客戶端而異。有關 Triton 支援的所有語言的完整列表，請參考 Triton 的客戶端倉庫。

pip install torchvision
pip install attrdict
pip install nvidia-pyindex
pip install tritonclient[all]

讓我們開始構建客戶端。首先，我們編寫一個小型的預處理函式來調整查詢影像的大小並進行歸一化。

import numpy as np
from torchvision import transforms
from PIL import Image
import tritonclient.http as httpclient
from tritonclient.utils import triton_to_np_dtype

# preprocessing function
def rn50_preprocess(img_path="/triton_example/img1.jpg"):
  img = Image.open(img_path)
  preprocess = transforms.Compose(
      [
          transforms.Resize(256),
          transforms.CenterCrop(224),
          transforms.ToTensor(),
          transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
      ]
  )
  return preprocess(img).unsqueeze(0).numpy()

 transformed_img = rn50_preprocess()

構建客戶端需要三個基本要點。首先，我們與 Triton 推理伺服器建立連線。

# Setting up client
client = httpclient.InferenceServerClient(url="localhost:8000")

其次，我們指定模型的輸入和輸出層的名稱。這可以在匯出期間獲得，並且應該已在您的 config.pbtxt 檔案中指定。

inputs = httpclient.InferInput("x", transformed_img.shape, datatype="FP32")
inputs.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput("output0", binary_data=True, class_count=1000)

最後，我們向 Triton 推理伺服器傳送推理請求。

# Querying the server
results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
inference_output = results.as_numpy('output0')
print(inference_output[:5])

輸出應該如下所示

[b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136'
 b'8.234375:11']

此處的輸出格式為 <置信度得分>:<分類索引>。要了解如何將這些對映到標籤名稱以及更多資訊，請參考 Triton 推理伺服器的文件。

您可以使用以下命令快速嘗試此客戶端

# Remember to use the same publishing tag for all steps (e.g. 24.08)

docker run -it --net=host -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk bash -c "pip install torchvision && python /triton_example/client.py"

使用 Triton 服務 Torch-TensorRT 模型¶

步驟 1：使用 Torch-TensorRT 最佳化模型¶

步驟 2：設定 Triton 推理伺服器¶

步驟 3：構建 Triton 客戶端來查詢伺服器¶

文件

教程

資源