Torch-TensorRT (FX 前端) 使用者指南¶

Torch-TensorRT (FX 前端) 是一個工具，可以透過 torch.fx 將 PyTorch 模型轉換為針對在 Nvidia GPU 上執行進行最佳化的 TensorRT 引擎。TensorRT 是 NVIDIA 開發的推理引擎，包含各種最佳化，如核心融合、圖最佳化、低精度等。該工具在 Python 環境中開發，使得研究人員和工程師可以非常方便地使用此工作流程。使用者在使用此工具時需要經歷幾個階段，我們將在此介紹這些階段。

> Torch-TensorRT (FX 前端) 目前處於 Beta 階段，建議與 PyTorch nightly 版本配合使用。

# Test an example by
$ python py/torch_tensorrt/fx/example/lower_example.py

將 PyTorch 模型轉換為 TensorRT 引擎¶

通常，使用者可以使用 compile() 完成從模型到 TensorRT 引擎的轉換。這是一個包裝 API，包含完成此轉換所需的主要步驟。請參閱 examples/fx 目錄下 lower_example.py 檔案中的示例用法。

def compile(
    module: nn.Module,
    input,
    max_batch_size=2048,
    max_workspace_size=33554432,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
) -> nn.Module:

    """
    Takes in original module, input and lowering setting, run lowering workflow to turn module
    into lowered module, or so called TRTModule.

    Args:
        module: Original module for lowering.
        input: Input for module.
        max_batch_size: Maximum batch size (must be >= 1 to be set, 0 means not set)
        max_workspace_size: Maximum size of workspace given to TensorRT.
        explicit_batch_dimension: Use explicit batch dimension in TensorRT if set True, otherwise use implicit batch dimension.
        lower_precision: lower_precision config given to TRTModule.
        verbose_log: Enable verbose log for TensorRT if set True.
        timing_cache_prefix: Timing cache file name for timing cache used by fx2trt.
        save_timing_cache: Update timing cache with current timing cache data if set to True.
        cuda_graph_batch_size: Cuda graph batch size, default to be -1.
        dynamic_batch: batch dimension (dim=0) is dynamic.
    Returns:
        A torch.nn.Module lowered by TensorRT.
    """

在本節中，我們將透過一個示例來說明 FX 路徑使用的主要步驟。使用者可以參考 examples/fx 目錄下 fx2trt_example.py 檔案。

步驟 1：使用 acc_tracer 跟蹤模型

Acc_tracer 是一個繼承自 FX tracer 的跟蹤器。它帶有引數歸一化器，用於將所有 args 轉換為 kwargs 並傳遞給 TRT 轉換器。

import torch_tensorrt.fx.tracer.acc_tracer.acc_tracer as acc_tracer

# Build the model which needs to be a PyTorch nn.Module.
my_pytorch_model = build_model()

# Prepare inputs to the model. Inputs have to be a List of Tensors
inputs = [Tensor, Tensor, ...]

# Trace the model with acc_tracer.
acc_mod = acc_tracer.trace(my_pytorch_model, inputs)

常見錯誤

符號跟蹤的變數不能用作控制流的輸入。這意味著模型包含動態控制流。請參閱 FX 指南中的“動態控制流”一節。

步驟 2：構建 TensorRT 引擎

關於 TensorRT 如何處理批次維度，有兩種不同的模式：顯式批次維度和隱式批次維度。隱式批次維度模式由早期版本的 TensorRT 使用，現已棄用，但為了向後相容仍提供支援。在顯式批次維度模式下，所有維度都是顯式的並且可以是動態的，這意味著它們的長度可以在執行時改變。許多新特性，如動態形狀和迴圈，僅在此模式下可用。當在 compile() 中設定 explicit_batch_dimension=False 時，使用者仍然可以選擇使用隱式批次維度模式。我們不建議使用它，因為它在未來的 TensorRT 版本中將缺乏支援。

顯式批次維度是預設模式，必須為動態形狀設定。對於大多數視覺任務，如果使用者想獲得與隱式模式類似的效果（即僅批次維度改變），可以在 compile() 中選擇啟用 dynamic_batch。它有一些要求：1. 輸入、輸出和啟用的形狀固定，除了批次維度。2. 輸入、輸出和啟用以批次維度作為主要維度。3. 模型中所有運算子不修改批次維度（如 permute, transpose, split 等）或在批次維度上進行計算（如 sum, softmax 等）。

對於最後一種情況，如果我們有一個形狀為 (batch, sequence, dimension) 的 3D 張量 t，操作如 torch.transpose(0, 2) 就屬於此例。如果這三點中的任何一點不滿足，我們就需要將 InputTensorSpec 指定為具有動態範圍的輸入。

import deeplearning.trt.fx2trt.converter.converters
from torch.fx.experimental.fx2trt.fx2trt import InputTensorSpec, TRTInterpreter

# InputTensorSpec is a dataclass we use to store input information.
# There're two ways we can build input_specs.
# Option 1, build it manually.
input_specs = [
  InputTensorSpec(shape=(1, 2, 3), dtype=torch.float32),
  InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]
# Option 2, build it using sample_inputs where user provide a sample
inputs = [
torch.rand((1,2,3), dtype=torch.float32),
torch.rand((1,4,5), dtype=torch.float32),
]
input_specs = InputTensorSpec.from_tensors(inputs)

# IMPORTANT: If dynamic shape is needed, we need to build it slightly differently.
input_specs = [
    InputTensorSpec(
        shape=(-1, 2, 3),
        dtype=torch.float32,
        # Currently we only support one set of dynamic range. User may set other dimensions but it is not promised to work for any models
        # (min_shape, optimize_target_shape, max_shape)
        # For more information refer to fx/input_tensor_spec.py
        shape_ranges = [
            ((1, 2, 3), (4, 2, 3), (100, 2, 3)),
        ],
    ),
    InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]

# Build a TRT interpreter. Set explicit_batch_dimension accordingly.
interpreter = TRTInterpreter(
    acc_mod, input_specs, explicit_batch_dimension=True/False
)

# The output of TRTInterpreter run() is wrapped as TRTInterpreterResult.
# The TRTInterpreterResult contains required parameter to build TRTModule,
# and other informational output from TRTInterpreter run.
class TRTInterpreterResult(NamedTuple):
    engine: Any
    input_names: Sequence[str]
    output_names: Sequence[str]
    serialized_cache: bytearray

#max_batch_size: set accordingly for maximum batch size you will use.
#max_workspace_size: set to the maximum size we can afford for temporary buffer
#lower_precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision).
#sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity
#force_fp32_output: force output to be fp32
#strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric #reasons.
#algorithm_selector: set up algorithm selection for certain layer
#timing_cache: enable timing cache for TensorRT
#profiling_verbosity: TensorRT logging level
trt_interpreter_result = interpreter.run(
    max_batch_size=64,
    max_workspace_size=1 << 25,
    sparse_weights=False,
    force_fp32_output=False,
    strict_type_constraints=False,
    algorithm_selector=None,
    timing_cache=None,
    profiling_verbosity=None,
)

常見錯誤

RuntimeError: 尚不支援函式 xxx 的轉換！ - 這意味著我們尚不支援此 xxx 運算子。有關進一步說明，請參閱下面的“如何新增缺失的運算子”一節。

步驟 3：執行模型

一種方法是使用 TRTModule，它本質上是一個 PyTorch nn.Module。

from torch_tensorrt.fx import TRTModule
mod = TRTModule(
    trt_interpreter_result.engine,
    trt_interpreter_result.input_names,
    trt_interpreter_result.output_names)
# Just like all other PyTorch modules
outputs = mod(*inputs)
torch.save(mod, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
reload_model_output = reload_trt_mod(*inputs)

至此，我們詳細解釋了將 PyTorch 模型轉換為 TensorRT 引擎的主要步驟。使用者可以參考原始碼以獲取一些引數的解釋。在轉換方案中，有兩個重要動作。一個是 acc tracer，它幫助我們將 PyTorch 模型轉換為 acc graph。另一個是 FX path converter，它幫助將 acc graph 的操作轉換為相應的 TensorRT 操作並構建 TensorRT 引擎。

Acc Tracer¶

Acc tracer 是一個自定義的 FX 符號跟蹤器。與普通的 FX 符號跟蹤器相比，它做了更多的事情。我們主要依賴它將 PyTorch ops 或內建 ops 轉換為 acc ops。fx2trt 使用 acc ops 的主要目的有兩個

PyTorch ops 和內建 ops 中有許多執行類似操作的 ops，例如 torch.add, builtin.add 和 torch.Tensor.add。使用 acc tracer，我們將這三個 ops 歸一化為單個 acc_ops.add。這有助於減少我們需要編寫的轉換器的數量。
acc ops 只有 kwargs，這使得編寫轉換器更容易，因為我們不需要新增額外的邏輯來查詢 args 和 kwargs 中的引數。

FX2TRT¶

符號跟蹤後，我們得到了 PyTorch 模型的圖表示。fx2trt 利用了 fx.Interpreter 的能力。fx.Interpreter 逐節點遍歷整個圖，並呼叫該節點表示的函式。fx2trt 透過為每個節點呼叫相應的轉換器來覆蓋呼叫函式的原始行為。每個轉換器函式新增相應的 TensorRT 層。

下面是一個轉換器函式的示例。裝飾器用於將此轉換器函式註冊到相應的節點。在此示例中，我們將此轉換器註冊到目標為 acc_ops.sigmoid 的 FX 節點。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

如何新增缺失的運算子¶

實際上，你可以在任何地方新增它，只需記住匯入檔案，以便在用 acc_tracer 跟蹤之前註冊所有 acc ops 和對映器。

步驟 1：新增新的 acc op

TODO：需要更多地解釋 acc op 的邏輯，例如何時拆分一個 op 以及何時重用其他 ops。

在 acc tracer 中，如果節點註冊有到 acc op 的對映，我們會將圖中的節點轉換為 acc ops。

為了實現到 acc ops 的轉換，需要滿足兩個條件。一是需要定義一個 acc op 函式，二是需要註冊一個對映。

定義 acc op 很簡單，首先只需要一個函式，並透過此裝飾器 acc_normalizer.py 將該函式註冊為 acc op。例如，以下程式碼添加了一個名為 foo() 的 acc op，用於將兩個給定輸入相加。

# NOTE: all acc ops should only take kwargs as inputs, therefore we need the "*"
# at the beginning.
@register_acc_op
def foo(*, input, other, alpha):
    return input + alpha * other

有兩種方法註冊對映。一種是 register_acc_op_mapping()。我們將 torch.add 對映到上面建立的 foo()。我們需要為其新增裝飾器 register_acc_op_mapping。

this_arg_is_optional = True

@register_acc_op_mapping(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

op_and_target 決定哪個節點會觸發此對映。op 和 target 是 FX 節點的屬性。在 acc_normalization 中，當我們看到一個節點的 op 和 target 與 op_and_target 中設定的相同，就會觸發對映。由於我們想從 torch.add 進行對映，因此 op 為 call_function，target 為 torch.add。arg_replacement_tuples 決定如何使用原始節點的 args 和 kwargs 為新的 acc op 節點構建 kwargs。arg_replacement_tuples 中的每個元組代表一個引數對映規則。它包含兩個或三個元素。第三個元素是一個布林變數，決定此 kwarg 在原始節點中是否是可選的。只有當它為 True 時，我們才需要指定第三個元素。第一個元素是原始節點中的引數名，它將被用作 acc op 節點的引數，該引數的名稱是元組中的第二個元素。元組的順序很重要，因為元組的位置決定了引數在原始節點 args 中的位置。我們使用此資訊將原始節點的 args 對映到 acc op 節點的 kwargs。如果以下條件都不滿足，我們無需指定 arg_replacement_tuples。

原始節點和 acc op 節點的 kwargs 名稱不同。
存在可選引數。

註冊對映的另一種方法是透過 register_custom_acc_mapper_fn()。這種方法旨在減少重複的 op 註冊，因為它允許你使用一個函式透過某種組合對映到一個或多個現有的 acc ops。在函式中，你可以做任何你想做的事情。讓我們用一個例子來解釋它是如何工作的。

@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

@register_custom_acc_mapper_fn(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
def custom_mapper(node: torch.fx.Node, _: nn.Module) -> torch.fx.Node:
    """
    `node` is original node, which is a call_function node with target
    being torch.add.
    """
    alpha = 1
    if "alpha" in node.kwargs:
        alpha = node.kwargs["alpha"]
    foo_kwargs = {"input": node["input"], "other": node["other"], "alpha": alpha}
    with node.graph.inserting_before(node):
        foo_node = node.graph.call_function(foo, kwargs=foo_kwargs)
        foo_node.meta = node.meta.copy()
        return foo_node

在自定義對映函式中，我們構建一個 acc op 節點並返回它。這裡返回的節點將接管原始節點的所有子節點 acc_normalizer.py。

最後一步是為我們新增的新 acc op 或對映器函式新增單元測試。新增單元測試的位置在這裡 test_acc_tracer.py。

步驟 2：新增新的轉換器

所有為 acc ops 開發的轉換器都在 acc_op_converter.py 中。它可以為你提供如何新增轉換器的好例子。

本質上，轉換器是將 acc ops 對映到 TensorRT 層的對映機制。如果我們能夠找到所有需要的 TensorRT 層，就可以開始使用 TensorRT API 為節點新增轉換器。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

我們需要使用 tensorrt_converter 裝飾器註冊轉換器。裝飾器的引數是我們需要轉換的 FX 節點的目標。在轉換器中，我們可以在 kwargs 中找到 FX 節點的輸入。如示例所示，原始節點是 acc_ops.sigmoid，它在 acc_ops.py 中只有一個引數“input”。我們獲取輸入並檢查它是否是 TensorRT 張量。之後，我們將一個 sigmoid 層新增到 TensorRT 網路並返回該層的輸出。我們返回的輸出將由 fx.Interpreter 傳遞給 acc_ops.sigmoid 的子節點。

如果我們無法在 TensorRT 中找到與該節點功能相同的對應層怎麼辦。

在這種情況下，我們需要做更多工作。TensorRT 提供了作為自定義層的外掛。我們尚未實現此功能。功能啟用後我們將更新。

最後一步是為我們新增的新轉換器新增單元測試。使用者可以在此資料夾中新增相應的單元測試。

Torch-TensorRT (FX 前端) 使用者指南¶

將 PyTorch 模型轉換為 TensorRT 引擎¶

Acc Tracer¶

FX2TRT¶

如何新增缺失的運算子¶

文件

教程

資源