Torch Export with Cudagraphs¶

CUDA Graphs 允許透過單個 CPU 操作啟動多個 GPU 操作，減少啟動開銷並提高 GPU 利用率。Torch-TensorRT 提供了簡單的介面來啟用 CUDA graphs。此功能使使用者能夠輕鬆利用 CUDA graphs 的效能優勢，而無需手動管理捕獲和重放的複雜性。

本互動式指令碼旨在概述如何在 ir=”dynamo” 路徑中使用 Torch-TensorRT Cudagraphs 整合的過程。該功能在 torch.compile 路徑中也類似工作。

匯入和模型定義¶

import torch
import torch_tensorrt
import torchvision.models as models

使用預設設定透過 torch_tensorrt.compile 進行編譯¶

# We begin by defining and initializing a model
model = models.resnet18(pretrained=True).eval().to("cuda")

# Define sample inputs
inputs = torch.randn((16, 3, 224, 224)).cuda()

# Next, we compile the model using torch_tensorrt.compile
# We use the `ir="dynamo"` flag here, and `ir="torch_compile"` should
# work with cudagraphs as well.
opt = torch_tensorrt.compile(
    model,
    ir="dynamo",
    inputs=torch_tensorrt.Input(
        min_shape=(1, 3, 224, 224),
        opt_shape=(8, 3, 224, 224),
        max_shape=(16, 3, 224, 224),
        dtype=torch.float,
        name="x",
    ),
)

使用 Cudagraphs 整合進行推理¶

# We can enable the cudagraphs API with a context manager
with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:
    out_trt = cudagraphs_module(inputs)

# Alternatively, we can set the cudagraphs mode for the session
torch_tensorrt.runtime.set_cudagraphs_mode(True)
out_trt = opt(inputs)

# We can also turn off cudagraphs mode and perform inference as normal
torch_tensorrt.runtime.set_cudagraphs_mode(False)
out_trt = opt(inputs)

# If we provide new input shapes, cudagraphs will re-record the graph
inputs_2 = torch.randn((8, 3, 224, 224)).cuda()
inputs_3 = torch.randn((4, 3, 224, 224)).cuda()

with torch_tensorrt.runtime.enable_cudagraphs(opt) as cudagraphs_module:
    out_trt_2 = cudagraphs_module(inputs_2)
    out_trt_3 = cudagraphs_module(inputs_3)

使用包含 Graph Breaks 的 Module 的 Cuda Graphs¶

當 CUDA Graphs 應用於包含 Graph Breaks 的 TensorRT 模型時，每個中斷都會引入額外的開銷。發生這種情況是因為 Graph Breaks 阻止整個模型作為一個單一、連續的最佳化單元執行。因此，CUDA Graphs 通常提供的部分效能優勢，例如減少的 Kernel 啟動開銷和改進的執行效率，可能會減弱。

使用帶有 CUDA Graphs 的包裝執行時模組可以讓你將操作序列封裝到圖中，即使存在 Graph Breaks，也能高效執行。如果 TensorRT 模組存在 Graph Breaks，CUDA Graph 上下文管理器會返回一個 wrapped_module。這個模組捕獲整個執行圖，透過減少 Kernel 啟動開銷和提高效能，在後續推理期間實現高效重放。

請注意，使用包裝器模組初始化需要一個熱身階段，在此階段模組會執行多次。這個熱身階段確保記憶體分配和初始化不會被記錄在 CUDA Graphs 中，這有助於保持一致的執行路徑並最佳化效能。

class SampleModel(torch.nn.Module):
    def forward(self, x):
        return torch.relu((x + 2) * 0.5)


model = SampleModel().eval().cuda()
input = torch.randn((1, 3, 224, 224)).to("cuda")

# The 'torch_executed_ops' compiler option is used in this example to intentionally introduce graph breaks within the module.
# Note: The Dynamo backend is required for the CUDA Graph context manager to handle modules in an Ahead-Of-Time (AOT) manner.
opt_with_graph_break = torch_tensorrt.compile(
    model,
    ir="dynamo",
    inputs=[input],
    min_block_size=1,
    pass_through_build_failures=True,
    torch_executed_ops={"torch.ops.aten.mul.Tensor"},
)

如果模組存在 Graph Breaks，整個子模組都會被 CUDA Graphs 記錄和重放

with torch_tensorrt.runtime.enable_cudagraphs(
    opt_with_graph_break
) as cudagraphs_module:
    cudagraphs_module(input)

指令碼總執行時間： ( 0 分鐘 0.000 秒)

由 Sphinx-Gallery 生成的 Gallery