在 C++ 中註冊一個 Dispatched Operator¶

創建於: 2020 年 7 月 22 日 | 最後更新於: 2024 年 7 月 22 日 | 最後驗證於: 2024 年 11 月 5 日

警告

本教程自 PyTorch 2.4 起已棄用。有關使用 Custom Operators 擴充套件 PyTorch 的最新指南，請參閱PyTorch Custom Operators。

Dispatcher 是 PyTorch 的內部元件，負責確定當你呼叫 torch::add 等函式時實際應該執行哪些程式碼。這可能很複雜，因為 PyTorch 操作需要處理許多“分層”在彼此之上的橫切關注點。以下是一些它處理的事項示例

根據輸入張量的裝置，在 operator 的 CPU 和 CUDA 實現之間切換。
根據是否需要處理 autograd，在 operator 的 autograd 和後端實現之間切換。
在自動混合精度需要時應用 autocasting。
當 operator 在 vmap 呼叫下執行時，應用批處理規則。
如果你正在追蹤模型以進行匯出，則追蹤操作的執行。

如果在你的自定義 operator 程式碼中，你發現自己手動編寫 if 語句來處理這些情況，那麼 dispatcher API 可以幫助組織你的程式碼。（反之，如果你的自定義 operator 非常簡單，僅用於 CPU 推理，你可能不需要使用 dispatcher，只需使用基本 API 即可。）

在本教程中，我們將介紹如何構建自定義 operator 註冊以使用 dispatcher 來組織各種元件。我們將假設你熟悉如何註冊 operator 以及如何編寫自定義 autograd 函式。

定義 schema 和後端實現¶

Dispatcher 的一般原理是它將 operator 的實現劃分為多個 kernel，每個 kernel 實現特定 dispatch key 的功能，例如 CPU、CUDA。在你呼叫 operator 時，dispatcher 會確定最高優先順序的 dispatch key 是什麼（這是透過檢視張量引數以及一些執行緒區域性狀態來完成的），並將控制權轉移給該 dispatch key 的 kernel。最終效果是，當你呼叫 operator 時，我們首先執行 Autograd kernel，然後根據傳入張量的裝置型別重新排程到後端 kernel。

讓我們看看實現這一過程涉及的各個部分。首先，我們必須定義相關 operator 的 schema。與簡單的 pybind11 式 operator 註冊不同，此時我們實際上並不提供 operator 的實現；我們只提供一個 schema 字串，指定 operator 的型別簽名，所有其他 kernel 都將遵循該簽名

TORCH_LIBRARY(myops, m) {
  m.def("myadd(Tensor self, Tensor other) -> Tensor");
}

接下來，我們需要實際提供這個 operator 的一些實現。具體來說，這裡有一個非常簡單的 CPU 加法實現

Tensor myadd_cpu(const Tensor& self_, const Tensor& other_) {
  TORCH_CHECK(self_.sizes() == other_.sizes());
  TORCH_INTERNAL_ASSERT(self_.device().type() == DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(other_.device().type() == DeviceType::CPU);
  Tensor self = self_.contiguous();
  Tensor other = other_.contiguous();
  Tensor result = torch::empty(self.sizes(), self.options());
  const float* self_ptr = self.data_ptr<float>();
  const float* other_ptr = other.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = self_ptr[i] + other_ptr[i];
  }
  return result;
}

我們希望將此函式註冊為 myops::myadd 的一個實現。然而，簡單的註冊方法 (def("myadd", myadd_cpu)) 會在所有情況下注冊該 kernel，即使張量不是 CPU 張量！(在內部，我們稱這些為“catch-all” kernels，因為它們捕獲所有情況。) 為了確保 myadd_cpu 僅對 CPU 張量執行，我們可以使用 TORCH_LIBRARY_IMPL 宏

TORCH_LIBRARY_IMPL(myops, CPU, m) {
  m.impl("myadd", myadd_cpu);
}

TORCH_LIBRARY_IMPL 允許我們在特定的 dispatch key (在此例中為 CPU) 上註冊 operator 的實現。對 impl 的每次呼叫都會將一個 CPU kernel 與相應的 operator 相關聯 (我們之前在 TORCH_LIBRARY 塊中定義了該 operator)。如果我們也有一個 CUDA 實現 myadd_cuda，我們可以在另一個 TORCH_LIBRARY_IMPL 塊中註冊它

TORCH_LIBRARY_IMPL(myops, CUDA, m) {
  m.impl("myadd", myadd_cuda);
}

這些註冊可以跨檔案甚至跨庫邊界進行；例如，你可以將這兩個 TORCH_LIBRARY_IMPL 塊編譯到單獨的 myops_cpu 和 myops_cuda 動態庫中。一般來說，你的註冊結構將是這樣的

一個單獨的 TORCH_LIBRARY，在一箇中心位置列出你的名稱空間中的每個自定義 operator。
每個 dispatch key 對應一個 TORCH_LIBRARY_IMPL，用於註冊該 key 的實現 (例如，CPU 或 CUDA)。如果你願意，可以進一步將 TORCH_LIBRARY_IMPL 塊細分為每個 operator 一個塊。如果你每個 operator 實現都有一個單獨的檔案，但不想在標頭檔案中公開 operator，這種方式很方便；你只需將註冊放在定義 operator 的 cpp 檔案中即可。

注意

你知道你也可以為 PyTorch 中現有的核心 operator 編寫 TORCH_LIBRARY_IMPL 塊嗎？PyTorch 的 XLA 支援就是這樣實現的：torch_xla 庫包含一個 TORCH_LIBRARY_IMPL，它為 XLA dispatch key 上的所有基本 operator 提供實現。

對於不需要 autograd 的 operator¶

注意：本節僅適用於 PyTorch >= 1.10 版本。

在下一節中，我們將討論如何為一個 operator 新增 autograd 支援。但對於不需要 autograd 支援的操作，應註冊以下 kernel 以提高可用性，並使你的操作 behave like PyTorch 的內建 operator。

TORCH_LIBRARY_IMPL(myops, Autograd, m) {
  m.impl(op, autogradNotImplementedFallback());
}

上面幾行註冊了一個 Autograd kernel，它在正向傳播時附加一個啞 NotImplemented 節點（保留輸入的 require_grad 屬性）。在反向傳播時，NotImplemented 節點會引發錯誤。這對於除錯大型模型很有幫助，因為之前很難精確定位正向傳播期間 requires_grad 屬性丟失的位置。

In-place 或 view 操作¶

為確保正確性和最佳效能，如果你的操作會就地改變輸入或返回與某個輸入張量別名的張量，應採取以下兩個額外步驟

除了上面的 Autograd kernel 外，還應註冊一個 ADInplaceOrView kernel。此 kernel 處理必要的簿記工作，以確保就地或 view 操作的正確性。需要注意的是，此 ADInplaceOrView kernel 應僅與 autogradNotImplementedFallback 一起使用。

TORCH_LIBRARY_IMPL(myops, Autograd, m) {
  m.impl(op, autogradNotImplementedFallback());
}
TORCH_LIBRARY_IMPL(myops, ADInplaceOrView, m) {
  m.impl(op, autogradNotImplementedInplaceOrViewFallback());
}

上面註冊的 Autograd 或 ADInplaceOrView boxed kernels 在其邏輯中依賴於 operator schema 資訊。如果你的操作會就地改變輸入或返回與某個輸入張量別名的張量，務必確保你的 schema 正確反映了這一點。有關如何標註 schema 的更多資訊，請參閱此處。

新增 autograd 支援¶

至此，我們有了一個同時包含 CPU 和 CUDA 實現的 operator。如何為其新增 autograd 支援呢？正如你可能猜到的，我們將註冊一個 autograd kernel (類似於自定義 autograd 函式教程中描述的內容)！然而，這裡有一個轉折：與 CPU 和 CUDA kernel 不同，autograd kernel 需要重新排程 (redispatch)：它需要回調到 dispatcher 以獲取推理 kernel，例如 CPU 或 CUDA 實現。

因此，在我們編寫 autograd kernel 之前，讓我們編寫一個排程函式 (dispatching function)，它會呼叫 dispatcher 來查詢你的 operator 的正確 kernel。此函式構成了你的 operator 的公共 C++ API——實際上，PyTorch C++ API 中的所有張量函式都在底層以相同的方式呼叫 dispatcher。下面是排程函式的示例

Tensor myadd(const Tensor& self, const Tensor& other) {
  static auto op = torch::Dispatcher::singleton()
    .findSchemaOrThrow("myops::myadd", "")
    .typed<decltype(myadd)>();
  return op.call(self, other);
}

讓我們分解一下

在第一行中，我們從 dispatcher 中查詢與我們要排程的 operator 對應的 typed operator handle。findSchemaOrThrow 接受兩個引數：operator 的 (名稱空間限定的) 名稱，以及 operator 的過載名稱 (通常是空字串)。typed 將動態型別 handle 轉換為靜態型別 handle (透過執行時測試以確保你提供了正確的 C++ 型別)，這樣我們就可以對其進行正常的 C++ 呼叫。我們傳入 decltype(myadd)，因為排程函式的型別與註冊到 dispatcher 的底層 kernel 的型別相同。

為了效能，此計算在靜態變數中完成，這樣我們只需進行一次（慢速）查詢。如果你鍵入了錯誤的 operator 名稱，第一次呼叫此函式時將出錯。
在第二行中，我們只需使用傳遞給排程函式的所有引數 call (呼叫) operator handle。這實際上會呼叫 dispatcher，最終控制權將轉移到適合此呼叫的任何 kernel。

有了排程函式，我們現在可以編寫 autograd kernel 了

class MyAddFunction : public torch::autograd::Function<MyAddFunction> {
 public:
  static Tensor forward(
      AutogradContext *ctx, torch::Tensor self, torch::Tensor other) {
    at::AutoNonVariableTypeMode g;
    return myadd(self, other);
  }

  static tensor_list backward(AutogradContext *ctx, tensor_list grad_outputs) {
    auto grad_output = grad_outputs[0];
    return {grad_output, grad_output};
  }
};

Tensor myadd_autograd(const Tensor& self, const Tensor& other) {
  return MyAddFunction::apply(self, other)[0];
}

autograd 函式使用 torch::autograd::Function 正常編寫，只是我們不是直接在 forward() 中編寫實現，而是

使用 at::AutoNonVariableTypeMode RAII guard 關閉 autograd 處理，然後
呼叫排程函式 myadd 以回撥到 dispatcher。

如果沒有 (1)，你的呼叫將進入無限迴圈 (並導致棧溢位)，因為 myadd 會將你送回此函式 (因為最高優先順序的 dispatch key 仍然是 autograd)。有了 (1)，autograd 將從考慮的 dispatch key 集合中排除，我們將轉到下一個 handler，它將是 CPU 或 CUDA。

我們現在可以像註冊 CPU/CUDA 函式一樣註冊此函數了

TORCH_LIBRARY_IMPL(myops, Autograd, m) {
  m.impl("myadd", myadd_autograd);
}

注意

在此示例中，我們將 kernel 註冊到 Autograd，這會將它安裝為所有後端的 autograd kernel。你還可以透過使用相應的特定後端 dispatch key (例如，AutogradCPU 或 AutogradCUDA`) 來註冊針對特定後端最佳化的 kernel。要更詳細地探索這些以及其他 dispatch key 選項，請查閱 torch/_python_dispatcher.py 中提供的 PythonDispatcher 工具。



超越 autograd¶
從某種意義上說，dispatcher 並沒有做太多事情：它所做的只是實現了一個更高階的 if 語句，類似於這樣
class MyAddFunction : ... {
public:
  static Tensor forward(
    AutogradContext *ctx, torch::Tensor self, torch::Tensor other) {

    if (self.device().type() == DeviceType::CPU) {
      return add_cpu(self, other);
    } else if (self.device().type() == DeviceType::CUDA) {
      return add_cuda(self, other);
    } else {
      TORCH_CHECK(0, "Unsupported device ", self.device().type());
    }
  }
  ...
}


那麼為什麼要使用 dispatcher 呢？有幾個原因

它是去中心化的。你可以組裝一個 operator 的所有部分（CPU、CUDA、Autograd），而無需編寫一個引用所有部分的單一、集中的 if 語句。重要的是，第三方可以為其他方面註冊額外的實現，而無需修改 operator 的原始定義。我們將在在 C++ 中為新後端擴充套件 dispatcher 中更詳細地討論擴充套件 dispatcher。
它支援比 CPU、CUDA 和 Autograd 更多的 dispatch key。你可以在 c10/core/DispatchKey.h 中看到 PyTorch 中當前實現的所有 dispatch key 列表。這些 dispatch key 為 operator 實現了各種可選功能，如果你決定希望你的自定義 operator 支援這些功能，你只需為相應的 key 註冊一個 kernel。
dispatcher 實現了對 boxed fallback 函式的支援，這些函式可以實現一次並應用於系統中的所有 operator。Boxed fallback 可以用於為 dispatch key 提供預設行為；如果你使用 dispatcher 實現你的 operator，你也選擇了所有這些操作的 fallback。

以下是一些你可能需要為其定義 operator 的特定 dispatch key。

Autocast¶
Autocast dispatch key 實現了對自動混合精度 (AMP) 的支援。一個 autocast wrapper kernel 通常會在執行操作之前將輸入的 float16 或 float32 CUDA 張量轉換為某種首選精度。例如，浮點 CUDA 張量上的 matmul 和卷積通常在 float16 中執行更快，使用更少的記憶體，同時不影響收斂。Autocast wrapper 僅在啟用了 autocast 的上下文中生效。
這是一個假設的自定義 matmul 的 autocast wrapper，以及它的註冊
// Autocast-specific helper functions
#include <ATen/autocast_mode.h>

Tensor mymatmul_autocast(const Tensor& self, const Tensor& other) {
  c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
  return mymatmul(at::autocast::cached_cast(at::kHalf, self),
                  at::autocast::cached_cast(at::kHalf, other));
}

TORCH_LIBRARY_IMPL(myops, Autocast, m) {
  m.impl("mymatmul", mymatmul_autocast);
}


cached_cast(kHalf, tensor) 將 tensor 轉換為 float16，如果 tensor 是 CUDA 且為 float32，否則保持 tensor 不變 (參見原生 autocasted 操作的資格策略)。這確保瞭如果網路在任何混合的 float16 和 float32 CUDA 張量上呼叫 mymatmul，mymatmul 將在 float16 中執行。同時，使用非 CUDA、整數型別或 float64 輸入呼叫 mymatmul 不受影響。建議在自己的 autocast wrapper 中使用 cached_cast 遵循原生資格策略，但並非強制要求。例如，如果你想強制所有輸入型別都以 float16 執行，可以使用 return mymatmul(self.half(), other.half()); 而非 cached_cast。
注意，就像我們的 autograd kernel 一樣，我們在重新排程之前，從排程中排除 Autocast key。
預設情況下，如果沒有提供 autocast wrapper，我們會直接 fallthrough 到常規 operator 實現（不會發生 autocasting）。（我們沒有使用 myadd 作為此示例，因為逐點加法不需要 autocasting，只需 fall through 即可。）
何時應該註冊 autocast wrapper？遺憾的是，對於操作的首選精度沒有明確的規定。你可以透過檢視cast lists 來了解一些原生操作的首選精度。一般指導意見

執行規約的操作應該可能以 float32 執行，
任何底層執行卷積或 gemm 的操作應該可能以 float16 執行，並且
其他具有多個浮點張量輸入的操作應將其標準化為共同的精度 (除非實現支援不同精度的輸入)。

如果你的自定義操作屬於第三類，promote_type 模板有助於確定輸入張量中存在的範圍最廣的浮點型別，這是執行型別的最安全選擇
#include <ATen/autocast_mode.h>

Tensor my_multiple_input_op_autocast(const Tensor& t0, const Tensor& t1) {
  c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
  // The required at::kHalf argument is an optimistic initial guess.
  auto exec_type = at::autocast::promote_type(at::kHalf, t0, t1);
  return my_multiple_input_op(at::autocast::cached_cast(exec_type, t0),
                              at::autocast::cached_cast(exec_type, t1));
}


如果你的自定義操作是啟用 autograd 的，你只需要為註冊 autograd wrapper 的相同名稱編寫並註冊一個 autocast wrapper。例如，如果你想為 autograd 部分中顯示的 myadd 函式建立一個 autocast wrapper，你只需要
Tensor myadd_autocast(const Tensor& self, const Tensor& other) {
  c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
  return myadd(at::autocast::cached_cast(<desired dtype>, self),
               at::autocast::cached_cast(<desired dtype>, other));
}

TORCH_LIBRARY_IMPL(myops, Autocast, m) {
  m.impl("myadd", myadd_autocast);
}


無需單獨的複雜操作來使 backward 方法相容 autocast。然而，自定義 autograd 函式中定義的 backward 方法將與 autocast 為 forward 方法設定的 dtype 相同，因此你應該選擇一個適合你的 forward 和 backward 方法的 <desired dtype>。


批處理 (Batched)¶
批處理張量允許你以逐樣本的方式編寫程式碼，然後在 vmap 呼叫下執行時自動進行批處理。編寫批處理規則的 API 目前正在開發中，但一旦穩定，你就可以透過在 Batched dispatch key 上註冊 kernel 來為你的 operator 新增 vmap 支援。


追蹤器 (Tracer)¶
Tracer dispatch key 實現了在執行 torch.jit.trace 時將 operator 呼叫記錄到追蹤中的支援。我們打算提供一個 boxed fallback 來實現任意操作的追蹤，參見 issue #41478 以跟蹤進度。

在 C++ 中註冊一個 Dispatched Operator¶

定義 schema 和後端實現¶

對於不需要 autograd 的 operator¶

In-place 或 view 操作¶

新增 autograd 支援¶

超越 autograd¶

Autocast¶

批處理 (Batched)¶

追蹤器 (Tracer)¶

文件

教程

資源