注意
轉到末尾 下載完整的示例程式碼
權重流送¶
TensorRT 中的權重流送是一項強大的功能,旨在克服處理大型模型時的 GPU 記憶體限制。它透過在推理過程中將權重資料從主機(CPU)記憶體流送到 GPU 記憶體,從而能夠在可用 GPU 記憶體之外執行更大的模型。
流送更多記憶體可能會導致效能下降。但是,如果流送權重允許使用者執行更大的批次大小,則可以提高吞吐量。這種增加的吞吐量有時可以抵消流送權重造成的效能下降。最佳流送記憶體量因特定模型和硬體而異。嘗試不同的記憶體限制有助於在流送開銷和批次大小優勢之間找到最佳平衡點。
本示例使用預訓練的 Llama-2 模型,並展示如何透過 Torch-TensorRT 使用權重流送功能。
編譯選項 - 構建具有權重流送功能的 TRT 引擎
執行時 API - 透過上下文管理器控制權重流送預算
匯入和模型定義¶
import copy
import timeit
import numpy as np
import torch
import torch_tensorrt
from transformers import AutoModelForCausalLM
from utils import export_llm
def time_generate(model, inputs, output_seq_length, iterations=10):
"""
Measure the time for generating a sentence over certain number of iterations
"""
# We only support single input (B x seq_len) for LLMs now
input_seq = inputs[0]
with torch.no_grad():
timings = []
for _ in range(iterations):
start_time = timeit.default_timer()
inputs_copy = copy.copy(input_seq)
# Greedy decoding of the model. This generates up to max_tokens.
while inputs_copy.shape[1] <= output_seq_length:
outputs = model(inputs_copy)
logits = outputs.logits
next_token_logits = logits[:, -1, :]
next_tokens = torch.argmax(next_token_logits, dim=-1)
inputs_copy = torch.cat([inputs_copy, next_tokens[:, None]], dim=-1)
torch.cuda.synchronize()
end_time = timeit.default_timer()
timings.append(end_time - start_time)
times = np.array(timings)
time_mean_ms = np.mean(times) * 1000
return time_mean_ms
# Load the LLaMA-2 model
DEVICE = torch.device("cuda:0")
llama_path = "meta-llama/Llama-2-7b-chat-hf"
with torch.no_grad():
model = AutoModelForCausalLM.from_pretrained(
llama_path, use_cache=False, attn_implementation="eager"
).eval()
# Set input and output sequence lengths
isl = 128
osl = 256
# Create random input tensors
input_tensors = [torch.randint(0, 5, (1, isl), dtype=torch.int64).cuda()]
# Convert the model to half precision (FP16)
model = model.half()
# Exports the LLM model into an ExportedProgram with dynamic shapes.
llama2_ep = export_llm(model, input_tensors[0], max_seq_len=osl)
編譯器選項¶
構建具有權重流送功能的引擎需要 enable_weight_streaming=True 選項和 use_explicit_typing=True 選項。 use_explicit_typing=True 選項會建立一個強型別網路,並且 enabled_precisions 選項中只允許 float32 精度。
# Create a TensorRT-compiled model
trt_model = torch_tensorrt.dynamo.compile(
llama2_ep,
inputs=input_tensors,
enabled_precisions={torch.float32},
truncate_double=True,
device=DEVICE,
use_explicit_typing=True,
enable_weight_streaming=True,
)
# Warm up for 3 iterations
_ = time_generate(trt_model, input_tensors, osl, 3)
使用自動預算大小執行¶
指定 enable_weight_streaming 編譯選項後,將配置自動預算大小。此自動大小可能並非總是提供最佳解決方案,因為自動確定的預算無法深入瞭解使用者的具體記憶體限制和使用模式
# Weight streaming context to get current weight budget information
weight_streaming_ctx = torch_tensorrt.runtime.weight_streaming(trt_model)
# Measure the mean latency of the model with weight streaming
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of current weight budget used
weight_budget_pct = (
weight_streaming_ctx.device_budget / weight_streaming_ctx.total_device_budget * 100
)
print(
f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
)
使用權重流送上下文管理器執行¶
可以透過使用權重流送上下文管理器來限制權重流送預算。預算大小的允許範圍是 0 到 ctx.total_device_budget。0 表示透過使用最少量的記憶體來實現最大記憶體節省。等於 ctx.total_device_budget 的值將停用權重流送。如果建立了多個 TRT 引擎,則預算將按比例分配。
# Use a context manager for weight streaming
with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
# Get the total size of streamable weights in the engine
streamable_budget = weight_streaming_ctx.total_device_budget
# Scenario 1: Automatic weight streaming budget
# Get the automatically determined weight streaming budget
requested_budget = weight_streaming_ctx.get_automatic_weight_streaming_budget()
# Set the device budget to the automatically determined value
weight_streaming_ctx.device_budget = requested_budget
# Measure the mean latency with automatic budget
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of the weight budget used
weight_budget_pct = (
weight_streaming_ctx.device_budget
/ weight_streaming_ctx.total_device_budget
* 100
)
print(
f"Set auto weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
)
# Scenario 2: Manual 10% weight streaming budget
# Set the budget to 10% of the total streamable weights
requested_budget = int(streamable_budget * 0.1)
weight_streaming_ctx.device_budget = requested_budget
# Measure the mean latency with 10% budget
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of the weight budget used
weight_budget_pct = (
weight_streaming_ctx.device_budget
/ weight_streaming_ctx.total_device_budget
* 100
)
print(
f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
)
指令碼總執行時間: ( 0 分 0.000 秒)