效能分析以了解 torch.compile 效能¶

torch.profiler 的用途：¶

torch.profiler 有助於在核心級別粒度了解程式的效能 - 例如，它可以在程式的層級顯示圖形中斷和 GPU 使用率。效能分析器提供的資料通常可以幫助使用者了解在哪裡進一步調查以了解模型效能。

為了了解核心級別的效能，還有其他工具可以使用。可以使用 NVIDIA 的 ncu 工具，或者 inductor 的效能分析工具。

使用 torch.profiler 和檢視追蹤的基本知識¶

範例程式：我們將使用這個效能分析 resnet18 的範例。請注意此範例程式的以下部分

包含預熱執行以等待編譯完成（這將預熱 CUDA 快取配置器等系統）
使用 torch.profiler.profile() 上下文來效能分析我們感興趣的部分
使用 prof.export_chrome_trace("trace.json") 匯出效能分析成品。

import torch
from torchvision.models import resnet18

model = resnet18().cuda()
inputs = [torch.randn((5, 3, 224, 224), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace.json")

檢視 Chrome 追蹤：在 Chrome 瀏覽器中，開啟 chrome://tracing 並載入 json 檔案。使用「w」和「s」鍵放大和縮小，並使用「a」和「d」鍵向左和向右捲動。「？」將顯示一個「說明」畫面，其中列出快捷鍵。

Example of a basic chrome trace, visualized in the chrome://tracing viewer

在這裡，我們觀察到：* CompiledFunction 和 CompiledFunctionBackward 事件，它們對應於 dynamo 編譯的區域。* 頂部的 CPU 事件和底部的 GPU 事件。

CPU 和 GPU 事件之間的流程

GPU 上的每個核心都在 CPU 上執行的程式碼啟動後才會發生。效能分析器可以在 GPU 和 CPU 事件之間繪製連接（即「流程」），以顯示哪個 CPU 事件啟動了 GPU 核心。這特別有用，因為除了一些例外情況外，GPU 核心是非同步啟動的。

若要檢視流程連接，請按一下 GPU 核心，然後按一下「ac2g」

Visualization in the chrome://trace viewer, showing an async flow between a kernel and its launching location.

或者，使用頂部的「流程事件」下拉式選單開啟*所有*流程。

解決 CUDA 圖形效能分析問題¶

啟用 CUDA 圖形時，某些 cuda 組態（驅動程式版本低於 525.85.12 或 CUDA < 12）可能會在效能分析工具和 CUDA 圖形之間遇到問題。若要解決這些問題，請在程式的頂部新增一個空的效能分析上下文

import torch

torch.profiler._utils._init_for_cuda_graphs()

# ... rest of program

了解編譯時間¶

若要了解編譯為何需要很長時間，您可以分析第一次呼叫 torch.compile 編譯的程式的效能。請記住，編譯的分析追蹤可能會比典型的分析更失真，因為編譯工作負載可能與典型的 PyTorch 工作負載截然不同。在某些情況下，追蹤檔案也可能非常大。超過 1GB 的追蹤檔案可能難以使用 Chrome 追蹤工具開啟。

注意：使用 torch._dynamo.utils.compile_times() 可以非圖形格式取得大致相同的資訊。此工具不會顯示編譯步驟發生的時間，但會顯示每個步驟花費的時間量 - 而且時間不會受到任何效能分析額外負荷的影響。

請參閱以下範例

import torch
from torchvision.models import resnet18

model = resnet18().cuda()
inputs = [torch.randn((5, 3, 224, 224), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

def warmup_compile():
    def fn(x):
        return x.sin().relu()

    x = torch.rand((2, 2), device='cuda', requires_grad=True)
    fn_c = torch.compile(fn)
    out = fn_c(x)
    out.sum().backward()

with torch.profiler.profile() as prof:
    with torch.profiler.record_function("warmup compile"):
        warmup_compile()

    with torch.profiler.record_function("resnet18 compile"):
        fwd_bwd(inputs[0])

prof.export_chrome_trace("trace_compile.json")

A visualization in the chrome://trace viewer, showing dynamo and inductor compilation steps

請注意以下幾點

第一次呼叫應該在*期間*發生效能分析，以便擷取編譯
新增預熱編譯，以便初始化任何需要延遲初始化的系統。

尋找圖形中斷：「Torch 編譯區域」和「CompiledFunction」¶

雖然有記錄工具可用於識別圖形中斷，但效能分析器提供了一種快速識別圖形中斷的視覺化方法。有兩種效能分析器事件需要注意：**Torch 編譯區域**和**CompiledFunction**。

**Torch 編譯區域**（在 PyTorch 2.2 中引入）是一個效能分析器事件，涵蓋整個編譯區域。圖形中斷幾乎總是看起來一樣：巢套的「Torch 編譯區域」事件。

如果您執行兩個獨立應用 torch.compile() 的獨立函數，您通常應該會看到兩個相鄰的（即非堆疊/巢套）Torch 編譯區域。同時，如果您遇到圖形中斷（或已 disable()/已跳過的區域），則預期會出現巢套的「Torch 編譯區域」事件。

**CompiledFunction**（在 PyTorch 2.0 中引入）是一個效能分析器事件，在需要任何輸入的梯度時出現。每個圖形中斷都會中斷 CompiledFunction 區塊，將其一分為二。CompiledFunction 事件僅在涉及 Autograd 時出現，即圖形的一些輸入張量具有 requires_grad=True。

當追蹤中出現 CompiledFunction 時，通常會在反向傳遞中與 CompiledFunctionBackward 事件配對。如果呼叫了反向函數，則追蹤中應該會出現連接兩者的「正向-反向連結」。

如果您的用例包含不需要梯度且不包含「Torch 編譯區域」事件的圖表，則可能更難以識別是否正確應用了 torch.compile。一個線索可能是 Inductor 生成的 Triton 核心的存在。

請參閱以下合成範例以進行示範

import torch
import torch._dynamo

class ModelWithBreaks(torch.nn.Module):
    def __init__(self):
        super().__init__()
        def create_sequential():
            return torch.nn.Sequential(
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
            )
        self.mod1 = create_sequential()
        self.mod2 = create_sequential()
        self.mod3 = create_sequential()
        self.mod4 = create_sequential()

    def forward(self, inp):
        mod1 = self.mod1(inp)
        torch._dynamo.graph_break()
        mod2 = self.mod2(mod1)
        torch._dynamo.graph_break()
        mod3 = self.mod3(mod2)
        torch._dynamo.graph_break()
        mod4 = self.mod4(mod3)
        return mod4


model = ModelWithBreaks().cuda()
inputs = [torch.randn((128, 128), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace_break.json")

Visualization in the chrome://trace viewer, showing nested Torch-Compiled Region events and multiple CompiledFunction events - indicating graph breaks.

運算子核心¶

當運算子啟動時，我們預計會看到一些事件

CPU 端事件
核心啟動（如果處理 GPU 核心）
GPU 端事件

Visualization in the chrome://trace viewer, showing the three types of events: CPU-side event, kernel launch, and GPU-side event

Inductor 生成的 Triton 核心： 1. CPU 端事件 應顯示為以「triton_」為前綴的事件。這些事件目前資訊最少 - 核心名稱和啟動，但資訊少於典型的 aten 核心啟動（包含輸入形狀、類型等）。 2. 核心啟動 應顯示為 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 是 aten ops 的典型）。 3. GPU 端事件 應會出現，名稱的描述性取決於 unique_kernel_names 的 Inductor 設定。

非 Inductor 生成的 Triton 核心

CPU 端 事件可能不會出現在追蹤中；自動插入分析器事件的機制目前是在 Inductor 層級實作的，因此繞過 Inductor 的 Triton 核心可能不會出現在追蹤中，除非使用者已手動標註它們。
核心啟動 應顯示為 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 是 aten ops 的典型）。
GPU 端 事件應會出現，名稱與撰寫的 triton 核心類似。

Inductor 生成的 CPU 核心

CPU 端事件 不會出現在追蹤中；我們尚未為此新增分析。
核心啟動 和 GPU 端事件 不存在

非 Triton 核心（即 aten 核心或自訂 ops）也應預計有時會出現在追蹤中。有時，Inductor 會回到原始的 op 實作，在這種情況下，您會看到對 aten op 的呼叫。

啟動開銷¶

一個常見的問題是 GPU 使用率不佳。識別這一點的快速方法是，如果 GPU 上的內核之間存在較大差距

Visualization in the chrome://trace viewer, showing large gaps between GPU kernels. This indicates that the model is CPU bound, likely due to overhead during kernel launches.

這通常是 CPU 開銷的結果，例如，如果在核心啟動之間花費在 CPU 上的時間大於 GPU 處理核心所花費的時間。這個問題在小批次大小的情況下更為常見。

使用 Inductor 時，在啟動開銷是一個問題時，啟用 CUDA 圖表通常可以幫助提高效能。