注意
跳至文末 下載完整示例程式碼
引擎快取¶
隨著模型尺寸增大,編譯成本也會增加。對於 AOT (Ahead-of-Time) 方法,如 torch.dynamo.compile,此成本 upfront 支付。然而,如果權重發生變化、會話結束或您正在使用 JIT (Just-in-Time) 方法,如 torch.compile,當圖變得無效時,它們會被重新編譯,此成本會重複支付。引擎快取是一種透過將已構建的引擎儲存到磁碟並在可能時重複使用它們來減輕此成本的方法。本教程演示瞭如何在 PyTorch 中使用 TensorRT 的引擎快取。引擎快取可以顯著加速後續的模型編譯,因為它重用了先前構建的 TensorRT 引擎。
我們將探討兩種方法
使用 torch_tensorrt.dynamo.compile
將 TensorRT 作為後端與 torch.compile 一起使用
該示例使用預訓練的 ResNet18 模型,並展示了無快取、啟用快取和重用快取引擎之間的編譯差異。
import os
from typing import Dict, Optional
import numpy as np
import torch
import torch_tensorrt as torch_trt
import torchvision.models as models
from torch_tensorrt.dynamo._defaults import TIMING_CACHE_PATH
from torch_tensorrt.dynamo._engine_cache import BaseEngineCache
np.random.seed(0)
torch.manual_seed(0)
model = models.resnet18(pretrained=True).eval().to("cuda")
enabled_precisions = {torch.float}
debug = False
min_block_size = 1
use_python_runtime = False
def remove_timing_cache(path=TIMING_CACHE_PATH):
if os.path.exists(path):
os.remove(path)
JIT 編譯的引擎快取¶
引擎快取的主要目標是幫助加速 JIT 工作流。torch.compile 在模型構建方面提供了極大的靈活性,這使其成為尋找加速工作流時的良好首選工具。然而,歷史上,編譯成本,特別是重新編譯成本,一直是許多使用者的障礙。如果在某種原因下子圖變得無效,在新增引擎快取之前,該圖將從頭開始重新構建。現在,當引擎被構建時,透過設定 cache_built_engines=True,引擎會儲存到磁碟,並與相應的 PyTorch 子圖的雜湊值關聯。如果在隨後的編譯中,無論是在當前會話還是新會話中,快取都將提取已構建的引擎並重新適配 (refit) 權重,這可以將編譯時間減少幾個數量級。因此,為了將新引擎插入快取(即 cache_built_engines=True),該引擎必須是可重新適配的 (immutable_weights=False)。有關更多詳細資訊,請參閱使用新權重 Refit(重新適配)Torch-TensorRT 程式。
def torch_compile(iterations=3):
times = []
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
# remove timing cache and reset dynamo just for engine caching messurement
remove_timing_cache()
torch._dynamo.reset()
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
compiled_model = torch.compile(
model,
backend="tensorrt",
options={
"use_python_runtime": True,
"enabled_precisions": enabled_precisions,
"debug": debug,
"min_block_size": min_block_size,
"immutable_weights": False,
"cache_built_engines": cache_built_engines,
"reuse_cached_engines": reuse_cached_engines,
},
)
compiled_model(*inputs) # trigger the compilation
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------torch_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
torch_compile()
AOT 編譯的引擎快取¶
與 JIT 工作流類似,AOT 工作流也可以從引擎快取中受益。當相同的架構或通用子圖被重新編譯時,快取將提取先前構建的引擎並重新適配權重。
def dynamo_compile(iterations=3):
times = []
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
example_inputs = (torch.randn((100, 3, 224, 224)).to("cuda"),)
# Mark the dim0 of inputs as dynamic
batch = torch.export.Dim("batch", min=1, max=200)
exp_program = torch.export.export(
model, args=example_inputs, dynamic_shapes={"x": {0: batch}}
)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100 + i, 3, 224, 224)).to("cuda")]
remove_timing_cache() # remove timing cache just for engine caching messurement
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
trt_gm = torch_trt.dynamo.compile(
exp_program,
tuple(inputs),
use_python_runtime=use_python_runtime,
enabled_precisions=enabled_precisions,
debug=debug,
min_block_size=min_block_size,
immutable_weights=False,
cache_built_engines=cache_built_engines,
reuse_cached_engines=reuse_cached_engines,
engine_cache_size=1 << 30, # 1GB
)
# output = trt_gm(*inputs)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------dynamo_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
dynamo_compile()
自定義引擎快取¶
預設情況下,引擎快取儲存在系統的臨時目錄中。可以透過傳遞 engine_cache_dir 和 engine_cache_size 來自定義快取目錄和大小限制。使用者還可以透過擴充套件 BaseEngineCache 類來定義自己的引擎快取實現。這允許實現遠端或共享快取(如果需要)。
- 自定義引擎快取應實現以下方法
save: 將引擎 blob 儲存到快取。load: 從快取載入引擎 blob。
快取系統提供的雜湊值是原始 PyTorch 子圖(降低後)的、與權重無關的雜湊值。blob 以 pickle 格式包含序列化的引擎、呼叫規範資料和權重對映資訊
下面是實現 RAMEngineCache 的自定義引擎快取實現示例。
class RAMEngineCache(BaseEngineCache):
def __init__(
self,
) -> None:
"""
Constructs a user held engine cache in memory.
"""
self.engine_cache: Dict[str, bytes] = {}
def save(
self,
hash: str,
blob: bytes,
):
"""
Insert the engine blob to the cache.
Args:
hash (str): The hash key to associate with the engine blob.
blob (bytes): The engine blob to be saved.
Returns:
None
"""
self.engine_cache[hash] = blob
def load(self, hash: str) -> Optional[bytes]:
"""
Load the engine blob from the cache.
Args:
hash (str): The hash key of the engine to load.
Returns:
Optional[bytes]: The engine blob if found, None otherwise.
"""
if hash in self.engine_cache:
return self.engine_cache[hash]
else:
return None
def torch_compile_my_cache(iterations=3):
times = []
engine_cache = RAMEngineCache()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
# remove timing cache and reset dynamo just for engine caching messurement
remove_timing_cache()
torch._dynamo.reset()
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
compiled_model = torch.compile(
model,
backend="tensorrt",
options={
"use_python_runtime": True,
"enabled_precisions": enabled_precisions,
"debug": debug,
"min_block_size": min_block_size,
"immutable_weights": False,
"cache_built_engines": cache_built_engines,
"reuse_cached_engines": reuse_cached_engines,
"custom_engine_cache": engine_cache,
},
)
compiled_model(*inputs) # trigger the compilation
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------torch_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
torch_compile_my_cache()
指令碼總執行時間: ( 0 分鐘 0.000 秒)