入門¶

在閱讀本節之前，請確保已閱讀 torch.compiler。

讓我們從一個簡單的 torch.compile 範例開始，該範例演示如何使用 torch.compile 進行推論。此範例演示了 torch.cos() 和 torch.sin() 功能，它們是逐點運算子的範例，因為它們在向量上逐元素運算。此範例可能不會顯示顯著的效能提升，但應有助於您直觀地瞭解如何在自己的程式中使用 torch.compile。

備註

若要執行此腳本，您的機器上至少需要有一個 GPU。如果您沒有 GPU，則可以刪除下方程式碼片段中的 .to(device="cuda:0") 程式碼，它將在 CPU 上執行。您也可以將裝置設定為 xpu:0 以在 Intel® GPU 上執行。

import torch
def fn(x):
   a = torch.cos(x)
   b = torch.sin(a)
   return b
new_fn = torch.compile(fn, backend="inductor")
input_tensor = torch.randn(10000).to(device="cuda:0")
a = new_fn(input_tensor)

您可能想使用的更著名的逐點運算子是 torch.relu() 之類的運算子。在 Eager 模式下，逐點運算子不是最佳的，因為每個運算子都需要從記憶體中讀取張量、進行一些更改，然後再寫回這些更改。inductor 執行的單一最重要優化是融合。在上面的範例中，我們可以將 2 次讀取（x、a）和 2 次寫入（a、b）轉為 1 次讀取（x）和 1 次寫入（b），這一點至關重要，尤其是在較新的 GPU 上，瓶頸是記憶體頻寬（將資料傳輸到 GPU 的速度）而不是計算能力（GPU 處理浮點運算的速度）。

inductor 提供的另一個主要優化是自動支援 CUDA 圖形。CUDA 圖形有助於消除從 Python 程式啟動個別核心所產生的開銷，這對於較新的 GPU 尤其重要。

TorchDynamo 支援許多不同的後端，但 TorchInductor 特別是透過產生 Triton 核心來運作。讓我們將上面的範例儲存到名為 example.py 的檔案中。我們可以透過執行 TORCH_COMPILE_DEBUG=1 python example.py 來檢查產生的 Triton 核心程式碼。當腳本執行時，您應該會看到終端機上印出了 DEBUG 訊息。在日誌的末尾附近，您應該會看到一個資料夾的路徑，其中包含 torchinductor_<您的使用者名稱>。在該資料夾中，您可以找到 output_code.py 檔案，其中包含產生的核心程式碼，類似於以下內容

@pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
   xnumel = 10000
   xoffset = tl.program_id(0) * XBLOCK
   xindex = xoffset + tl.arange(0, XBLOCK)[:]
   xmask = xindex < xnumel
   x0 = xindex
   tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0)
   tmp1 = tl.cos(tmp0)
   tmp2 = tl.sin(tmp1)
   tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

備註

上面的程式碼片段只是一個範例。根據您的硬體，您可能會看到產生的程式碼不同。

您可以驗證 cos 和 sin 的融合是否確實發生，因為 cos 和 sin 運算發生在單一 Triton 核心內，而臨時變數則保存在具有非常快速存取速度的暫存器中。

在這裡閱讀更多關於 Triton 效能的資訊。因為程式碼是用 Python 編寫的，所以即使您沒有編寫過那麼多 CUDA 核心，也很容易理解。

接下來，讓我們嘗試一個來自 PyTorch Hub 的真實模型，例如 resnet50。

import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(1,3,64,64))

這也不是唯一可用的後端，您可以在 REPL 中執行 torch.compiler.list_backends() 來查看所有可用的後端。接下來嘗試使用 cudagraphs 作為靈感。

使用預先訓練的模型¶

PyTorch 使用者經常利用來自 transformers 或 TIMM 的預先訓練模型，而 TorchDynamo 和 TorchInductor 的設計目標之一是與人們想要創作的任何模型開箱即用。

讓我們直接從 HuggingFace Hub 下載預先訓練的模型並對其進行優化

import torch
from transformers import BertTokenizer, BertModel
# Copy pasted from here https://huggingface.tw/bert-base-uncased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
output = model(**encoded_input)

如果您從模型和 encoded_input 中移除 to(device="cuda:0")，則 Triton 將產生針對在 CPU 上執行而最佳化的 C++ 核心。您可以檢查 BERT 的 Triton 或 C++ 核心。它們比我們上面嘗試的三角函數範例更複雜，但您可以類似地瀏覽它，看看您是否瞭解 PyTorch 的工作原理。

同樣，讓我們嘗試一個 TIMM 範例

import timm
import torch
model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(64,3,7,7))

後續步驟¶

在本節中，我們回顧了一些推論範例，並對 torch.compile 的工作原理有基本的瞭解。以下是您接下來要查看的內容

入門¶

使用預先訓練的模型¶

後續步驟¶

文件

教學課程

資源