使用分析工具理解 torch.compile 效能¶

torch.profiler 的用途：¶

torch.profiler 對於理解程式在核心級別的效能很有幫助——例如，它可以顯示圖中斷和資源利用情況。分析器提供的資料通常可以幫助使用者理解應該進一步調查哪些方面來了解模型效能。

要理解核心級別的效能，可以使用其他工具，例如 Nvidia Nsight compute tool, AMD Omnitrace, Intel® VTune™ Profiler 或 inductor 的分析工具。

使用 torch.profiler 和檢視 trace 的基礎知識¶

示例程式：我們將使用這個分析 resnet18 的例子。請注意這個示例程式的以下部分：

包含一次熱身執行，以等待編譯完成（這將預熱 CUDA 快取分配器等系統）。
使用 torch.profiler.profile() 上下文來分析我們感興趣的部分。
使用 prof.export_chrome_trace("trace.json") 匯出分析工件。

import torch
from torchvision.models import resnet18

device = 'cuda'      # or 'cpu', 'xpu', etc.
model = resnet18().to(device)

inputs = [torch.randn((5, 3, 224, 224), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace.json")

檢視 chrome trace：在 Chrome 瀏覽器中，開啟 chrome://tracing 並載入 json 檔案。使用“w”和“s”鍵放大和縮小，使用“a”和“d”鍵左右滾動。“?” 將顯示一個包含快捷方式列表的“幫助”螢幕。

Example of a basic chrome trace, visualized in the chrome://tracing viewer

在這裡，我們觀察到：* CompiledFunction 和 CompiledFunctionBackward 事件，它們對應於 dynamo 編譯的區域。* 頂部的 CPU 事件，底部的 GPU 事件。

CPU 和加速器事件之間的流程

加速器上的每個核心都在 CPU 上執行的程式碼啟動後發生。分析器可以在加速器和 CPU 事件之間繪製連線（即“流程”），以顯示哪個 CPU 事件啟動了加速器核心。這特別有用，因為除了少數例外，加速器核心是非同步啟動的。

要檢視流程連線，請單擊 GPU 核心並單擊“ac2g”。

Visualization in the chrome://trace viewer, showing an async flow between a kernel and its launching location.

或者，透過頂部的“Flow events”下拉選單開啟所有流程。

解決 CUDA Graph 分析問題¶

啟用 CUDA 圖時，某些 CUDA 配置（驅動版本低於 525.85.12 或 CUDA < 12）可能會遇到分析工具和 CUDA 圖之間的問題。要解決這些問題，請在程式頂部新增一個空的分析上下文：

import torch

torch.profiler._utils._init_for_cuda_graphs()

# ... rest of program

理解編譯時間¶

要理解為什麼編譯花費時間很長，您可以分析 torch.compile 程式的第一次呼叫。請記住，編譯的分析 trace 比典型分析更可能失真，因為編譯工作負載可能與典型 PyTorch 工作負載大不相同。在某些情況下，trace 檔案也可能非常大。大於 1GB 的 trace 很難用 chrome trace 工具開啟。

注意：透過 torch._dynamo.utils.compile_times() 也可以獲得大致相同的資訊，格式非圖形化。這個實用工具不會顯示編譯步驟何時發生，但會顯示每個步驟花費的時間——並且時間不會受到任何分析開銷的影響。

請參閱下面的示例：

import torch
from torchvision.models import resnet18

# user can switch between cuda and xpu
device = 'cuda'
model = resnet18().to(device)
inputs = [torch.randn((5, 3, 224, 224), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

def warmup_compile():
    def fn(x):
        return x.sin().relu()

    x = torch.rand((2, 2), device=device, requires_grad=True)
    fn_c = torch.compile(fn)
    out = fn_c(x)
    out.sum().backward()

with torch.profiler.profile() as prof:
    with torch.profiler.record_function("warmup compile"):
        warmup_compile()

    with torch.profiler.record_function("resnet18 compile"):
        fwd_bwd(inputs[0])

prof.export_chrome_trace("trace_compile.json")

A visualization in the chrome://trace viewer, showing dynamo and inductor compilation steps

注意幾點：

第一次呼叫應該在分析期間發生，以便捕獲編譯過程。
新增一次熱身編譯，以便初始化任何需要延遲初始化的系統。

查詢圖中斷：“Torch-Compiled Region” 和 “CompiledFunction”¶

雖然有用於識別圖中斷的日誌記錄工具，但分析器提供了一種快速直觀的方法來識別圖中斷。需要查詢兩個分析器事件：Torch-Compiled Region 和 CompiledFunction。

Torch-Compiled Region - 在 PyTorch 2.2 中引入 - 是一個分析器事件，覆蓋整個編譯區域。圖中斷幾乎總是看起來一樣：巢狀的 “Torch-Compiled Region” 事件。

如果您執行兩個單獨的函式，並且每個都獨立應用了 torch.compile()，通常您應該看到兩個相鄰的（即不堆疊/巢狀的）Torch-Compiled Region。同時，如果您遇到圖中斷（或 disable()/skipped 區域），則會看到巢狀的 “Torch-Compiled Region” 事件。

CompiledFunction - 在 PyTorch 2.0 中引入 - 是當任何輸入需要梯度時出現的分析器事件。每個圖中斷都會打斷一個 CompiledFunction 塊，將其分成兩部分。CompiledFunction 事件僅在涉及 Autograd 時出現，即圖的一些輸入張量 requires_grad=True。

當 CompiledFunction 出現在 trace 中時，通常會與反向傳播中的 CompiledFunctionBackward 事件配對。如果呼叫了反向函式，trace 中應該會出現連線兩者的“fwd-bwd 連結”。

如果您的用例包含一個不需要梯度且不包含“Torch-Compiled Region”事件的圖，則可能更難以確定是否正確應用了 torch.compile。一個線索可能是存在 Inductor 生成的 Triton 核心。

請參閱下面的合成示例進行演示：

import torch
import torch._dynamo
# user can switch between cuda and xpu
device = 'cuda'

class ModelWithBreaks(torch.nn.Module):
    def __init__(self):
        super().__init__()
        def create_sequential():
            return torch.nn.Sequential(
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
            )
        self.mod1 = create_sequential()
        self.mod2 = create_sequential()
        self.mod3 = create_sequential()
        self.mod4 = create_sequential()

    def forward(self, inp):
        mod1 = self.mod1(inp)
        torch._dynamo.graph_break()
        mod2 = self.mod2(mod1)
        torch._dynamo.graph_break()
        mod3 = self.mod3(mod2)
        torch._dynamo.graph_break()
        mod4 = self.mod4(mod3)
        return mod4

model = ModelWithBreaks().to(device)
inputs = [torch.randn((128, 128), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace_break.json")

Visualization in the chrome://trace viewer, showing nested Torch-Compiled Region events and multiple CompiledFunction events - indicating graph breaks.

運算元核心¶

當一個運算元啟動時，我們期望看到幾個事件：

CPU 端事件
核心啟動（如果處理 GPU 核心）
GPU 端事件

Visualization in the chrome://trace viewer, showing the three types of events: CPU-side event, kernel launch, and GPU-side event

Inductor 生成的 Triton 核心： 1. CPU 端事件應顯示為以“triton_”為字首的事件。目前的事件資訊很少——只有核心名稱和啟動資訊，比典型的 aten 核心啟動資訊少（後者包含輸入形狀、型別等）。 2. 核心啟動應顯示為 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 是 aten 運算元的典型啟動方式）。 3. GPU 端事件應顯示，其名稱描述性取決於 inductor config 中的 unique_kernel_names 設定。

非 Inductor 生成的 Triton 核心

CPU 端事件可能不會出現在 trace 中；自動插入分析器事件的機制目前是在 Inductor 層面實現的，因此繞過 Inductor 的 Triton 核心可能不會出現在 trace 中，除非使用者手動進行了標註。
核心啟動應顯示為 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 是 aten 運算元的典型啟動方式）。
GPU 端事件應顯示，其命名方式類似於編寫的 triton 核心。

Inductor 生成的 CPU 核心

CPU 端事件將不會出現在 trace 中；我們尚未為此新增分析功能。
核心啟動和GPU 端事件不存在。

非 Triton 核心（即 aten 核心或自定義運算元）有時也應出現在 trace 中。有時，Inductor 會回退到原始運算元實現，在這種情況下，您會看到對 aten 運算元的呼叫。

啟動開銷¶

一個常見問題是 GPU 利用率不高。快速識別這個問題的方法是觀察 GPU 核心之間是否存在較大的間隔。

Visualization in the chrome://trace viewer, showing large gaps between GPU kernels. This indicates that the model is CPU bound, likely due to overhead during kernel launches.

這通常是由於 CPU 開銷造成的，例如 CPU 在核心啟動之間花費的時間大於 GPU 處理核心的時間。對於小批次大小來說，這個問題更常見。

使用 inductor 時，在啟動開銷成為問題時，啟用 CUDA 圖通常有助於提高效能。