自動混合精度套件 - torch.amp¶

torch.amp 提供了混合精度的便捷方法，其中一些運算使用 torch.float32 (float) 資料類型，而其他運算使用較低精度的浮點數資料類型 (lower_precision_fp)：torch.float16 (half) 或 torch.bfloat16。某些運算，例如線性層和卷積，在 lower_precision_fp 中速度要快得多。其他運算，例如歸約，通常需要 float32 的動態範圍。混合精度會嘗試將每個運算与其適當的資料類型相匹配。

通常，資料類型為 torch.float16 的「自動混合精度訓練」會同時使用 torch.autocast 和 torch.amp.GradScaler，如自動混合精度範例和自動混合精度訣竅中所示。但是，torch.autocast 和 torch.GradScaler 是模組化的，如果需要，可以單獨使用。如 torch.autocast 的 CPU 範例章節所示，資料類型為 torch.bfloat16 的 CPU 上的「自動混合精度訓練/推斷」僅使用 torch.autocast。

警告

torch.cuda.amp.autocast(args...) 和 torch.cpu.amp.autocast(args...) 將被棄用。請改用 torch.autocast("cuda", args...) 或 torch.autocast("cpu", args...)。torch.cuda.amp.GradScaler(args...) 和 torch.cpu.amp.GradScaler(args...) 將被棄用。請改用 torch.GradScaler("cuda", args...) 或 torch.GradScaler("cpu", args...)。

torch.autocast 和 torch.cpu.amp.autocast 是 1.10 版的新功能。

自動轉換
梯度縮放
自動轉換運算子參考

自動轉換 ¶

torch.amp.autocast_mode.is_autocast_available(device_type)[source]¶

返回一個布林值，指示 device_type 上是否可以使用自動轉換。

參數: device_type (str) – 要使用的裝置類型。可能的值為：'cuda'、'cpu'、'xpu' 等等。類型與 torch.device 的 type 屬性相同。因此，您可以使用 Tensor.device.type 獲取張量的裝置類型。
返回類型: bool

class torch.autocast(device_type, dtype=None, enabled=True, cache_enabled=None)[source]¶

autocast 的實例可作為上下文管理器或裝飾器，允許腳本中的區域以混合精度運行。

在這些區域中，運算以 autocast 選擇的特定於運算的 dtype 運行，以在保持準確性的同時提高性能。詳情請參閱 Autocast 運算參考。

進入啟用 autocast 的區域時，張量可以是任何類型。使用自動轉換時，不應在模型或輸入上呼叫 half() 或 bfloat16()。

autocast 應該只包裝網路的前向傳遞，包括損失計算。不建議在 autocast 下進行反向傳遞。反向運算以與 autocast 用於相應前向運算相同的類型運行。

CUDA 設備示例

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with torch.autocast(device_type="cuda"):
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

有關在更複雜的情況下（例如，梯度懲罰、多個模型/損失、自定義 autograd 函數）的使用（以及梯度縮放）示例，請參閱自動混合精度示例。

autocast 也可用作裝飾器，例如，在模型的 forward 方法上

class AutocastModel(nn.Module):
    ...
    @torch.autocast(device_type="cuda")
    def forward(self, input):
        ...

在啟用 autocast 的區域中產生的浮點張量可能是 float16。返回到禁用 autocast 的區域後，將它們與不同 dtype 的浮點張量一起使用可能會導致類型不匹配錯誤。如果是這樣，請將 autocast 區域中產生的張量轉換回 float32（或其他所需的 dtype）。如果 autocast 區域中的張量已經是 float32，則轉換是無操作的，並且不會產生額外的開銷。CUDA 示例

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with torch.autocast(device_type="cuda"):
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = torch.mm(d_float32, f_float16.float())

CPU 訓練示例

# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

CPU 推理示例

# Creates model in default precision
model = Net().eval()

with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    for input in data:
        # Runs the forward pass with autocasting.
        output = model(input)

使用 Jit Trace 的 CPU 推理示例

class TestModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, num_classes)
    def forward(self, x):
        return self.fc1(x)

input_size = 2
num_classes = 2
model = TestModel(input_size, num_classes).eval()

# For now, we suggest to disable the Jit Autocast Pass,
# As the issue: https://github.com/pytorch/pytorch/issues/75956
torch._C._jit_set_autocast_mode(False)

with torch.cpu.amp.autocast(cache_enabled=False):
    model = torch.jit.trace(model, torch.randn(1, input_size))
model = torch.jit.freeze(model)
# Models Run
for _ in range(3):
    model(torch.randn(1, input_size))

autocast 啟用區域*中*的類型不匹配錯誤是一個錯誤；如果您觀察到這種情況，請提交問題。

autocast(enabled=False) 子區域可以嵌套在啟用 autocast 的區域中。本地禁用 autocast 非常有用，例如，如果您想強制子區域以特定的 dtype 運行。禁用 autocast 使您可以明確控制執行類型。在子區域中，應在使用之前將來自周圍區域的輸入轉換為 dtype

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with torch.autocast(device_type="cuda"):
    e_float16 = torch.mm(a_float32, b_float32)
    with torch.autocast(device_type="cuda", enabled=False):
        # Calls e_float16.float() to ensure float32 execution
        # (necessary because e_float16 was created in an autocasted region)
        f_float32 = torch.mm(c_float32, e_float16.float())

    # No manual casts are required when re-entering the autocast-enabled region.
    # torch.mm again runs in float16 and produces float16 output, regardless of input types.
    g_float16 = torch.mm(d_float32, f_float32)

autocast 狀態是線程本地的。如果您希望在新線程中啟用它，則必須在該線程中調用上下文管理器或裝飾器。當每個進程使用多個 GPU 時，這會影響 torch.nn.DataParallel 和 torch.nn.parallel.DistributedDataParallel（請參閱使用多個 GPU）。

參數

device_type (str, 必填) – 要使用的設備類型。可能的值為：'cuda'、'cpu'、'xpu' 和 'hpu'。類型與 torch.device 的 type 屬性相同。因此，您可以使用 Tensor.device.type 獲取張量的設備類型。
enabled (bool, 可選) – 是否應在區域中啟用自動轉換。默認值：True
dtype (torch_dtype, 可選) – 在 autocast 中運行的運算的數據類型。如果 dtype 為 None，則使用 get_autocast_dtype() 給出的默認值（CUDA 為 torch.float16，CPU 為 torch.bfloat16）。默認值：None
cache_enabled (bool, 可選) – 是否應啟用 autocast 內部的權重緩存。默認值：True

torch.amp.custom_fwd(fwd=None, *, device_type, cast_inputs=None)[source]¶

為自定義 autograd 函數的 forward 方法創建一個輔助裝飾器。

Autograd 函數是 torch.autograd.Function 的子類。有關更多詳細信息，請參閱示例頁面。

參數

device_type (str) – 要使用的設備類型。'cuda'、'cpu'、'xpu' 等等。類型與 torch.device 的 type 屬性相同。因此，您可以使用 Tensor.device.type 獲取張量的設備類型。
cast_inputs (torch.dtype 或 None，可選，默認為 None) – 如果不是 None，當 forward 在啟用 autocast 的區域中運行時，將傳入的浮點張量轉換為目標 dtype（非浮點張量不受影響），然後在禁用 autocast 的情況下執行 forward。如果為 None，則 forward 的內部運算將使用當前的 autocast 狀態執行。

注意

如果在啟用 autocast 的區域之外調用裝飾的 forward，則 custom_fwd 為無操作，並且 cast_inputs 無效。

torch.amp.custom_bwd(bwd=None, *, device_type)[source]¶

為自定義 autograd 函數的反向方法創建一個輔助裝飾器。

Autograd 函數是 torch.autograd.Function 的子類。確保 backward 以與 forward 相同的 autocast 狀態執行。有關更多詳細信息，請參閱示例頁面。

參數: device_type (str) – 要使用的設備類型。'cuda'、'cpu'、'xpu' 等等。類型與 torch.device 的 type 屬性相同。因此，您可以使用 Tensor.device.type 獲取張量的設備類型。

class torch.cuda.amp.autocast(enabled=True, dtype=torch.float16, cache_enabled=True)[source]¶

請參閱 torch.autocast。

torch.cuda.amp.autocast(args...) 已棄用。請改用 torch.amp.autocast("cuda", args...)。

torch.cuda.amp.custom_fwd(fwd=None, *, cast_inputs=None)[source]¶: torch.cuda.amp.custom_fwd(args...) 已棄用。請改用 torch.amp.custom_fwd(args..., device_type='cuda')。

torch.cuda.amp.custom_bwd(bwd)[source]¶: torch.cuda.amp.custom_bwd(args...) 已被棄用。請改用 torch.amp.custom_bwd(args..., device_type='cuda')。

class torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16, cache_enabled=True)[source]¶

請參閱 torch.autocast。 torch.cpu.amp.autocast(args...) 已被棄用。請改用 torch.amp.autocast("cpu", args...)。

梯度縮放 ¶

如果特定運算的前向傳遞具有 float16 輸入，則該運算的反向傳遞將產生 float16 梯度。具有小幅度的梯度值可能無法以 float16 表示。這些值將刷新為零（「下溢」），因此將遺失對應參數的更新。

為了防止下溢，「梯度縮放」會將網路的損失（es）乘以縮放因子，並對縮放後的損失（es）調用反向傳遞。然後，通過網路反向流動的梯度將按相同的因子進行縮放。換句話說，梯度值具有更大的幅度，因此它們不會刷新為零。

每個參數的梯度（.grad 屬性）應在優化器更新參數之前取消縮放，因此縮放因子不會干擾學習率。

注意

AMP/fp16 可能不適用於所有模型！例如，大多數 bf16 預先訓練的模型無法在最大值為 65504 的 fp16 數字範圍內運行，並且會導致梯度溢出而不是下溢。在這種情況下，縮放因子可能會減小到 1 以下，以嘗試將梯度帶到 fp16 動態範圍內可表示的數字。雖然人們可能期望縮放比例始終高於 1，但我們的 GradScaler 並不保證維持性能。如果在使用 AMP/fp16 運行時在損失或梯度中遇到 NaN，請確認您的模型是否兼容。

class torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)[source]¶

請參閱 torch.amp.GradScaler。 torch.cuda.amp.GradScaler(args...) 已被棄用。請改用 torch.amp.GradScaler("cuda", args...)。

自動轉換運算參考 ¶

運算資格 ¶

以 float64 或非浮點數據類型運行的運算不符合資格，並且無論是否啟用自動轉換，都將以這些類型運行。

只有異地運算和 Tensor 方法符合資格。在啟用自動轉換的區域中允許就地變體和明確提供 out=... Tensor 的調用，但不會進行自動轉換。例如，在啟用自動轉換的區域中，a.addmm(b, c) 可以自動轉換，但 a.addmm_(b, c) 和 a.addmm(b, c, out=d) 不能。為了獲得最佳性能和穩定性，請在啟用自動轉換的區域中優先使用異地運算。

使用顯式 dtype=... 參數調用的運算不符合資格，並且將產生符合 dtype 參數的輸出。

CUDA 運算特定行為 ¶

以下列表描述了在啟用自動轉換的區域中符合資格的運算的行為。無論這些運算是作為 torch.nn.Module 的一部分、作為函數還是作為 torch.Tensor 方法調用，它們始終會進行自動轉換。如果函數在多個命名空間中公開，則無論命名空間如何，它們都會進行自動轉換。

未在下方列出的運算不會進行自動轉換。它們以其輸入定義的類型運行。但是，如果未列出的運算位於自動轉換運算的下游，則自動轉換仍然可以更改其運行類型。

如果未列出運算，我們假設它在 float16 中數值穩定。如果您認為未列出的運算在 float16 中數值不穩定，請提交問題。

可以自動轉換為 `float16` 的 CUDA 運算 ¶

__matmul__、addbmm、addmm、addmv、addr、baddbmm、bmm、chain_matmul、multi_dot、conv1d、conv2d、conv3d、conv_transpose1d、conv_transpose2d、conv_transpose3d、GRUCell、linear、LSTMCell、matmul、mm、mv、prelu、RNNCell

可以自動轉換為 `float32` 的 CUDA 運算 ¶

__pow__、__rdiv__、__rpow__、__rtruediv__、acos、asin、binary_cross_entropy_with_logits、cosh、cosine_embedding_loss、cdist、cosine_similarity、cross_entropy、cumprod、cumsum、dist、erfinv、exp、expm1、group_norm、hinge_embedding_loss、kl_div、l1_loss、layer_norm、log、log_softmax、log10、log1p、log2、margin_ranking_loss、mse_loss、multilabel_margin_loss、multi_margin_loss、nll_loss、norm、normalize、pdist、poisson_nll_loss、pow、prod、reciprocal、rsqrt、sinh、smooth_l1_loss、soft_margin_loss、softmax、softmin、softplus、sum、renorm、tan、triplet_margin_loss

會提升至最廣輸入類型的 CUDA 運算 ¶

這些運算不需要特定的 dtype 來確保穩定性，但會接收多個輸入，並且需要輸入的 dtype 相符。如果所有輸入都是 float16，則運算會以 float16 執行。如果有任何輸入是 float32，則自動轉換會將所有輸入轉換為 float32，並以 float32 執行運算。

addcdiv、addcmul、atan2、bilinear、cross、dot、grid_sample、index_put、scatter_add、tensordot

這裡未列出的某些運算（例如，二元運算，如 add）會在沒有自動轉換介入的情況下原生提升輸入。如果輸入是 float16 和 float32 的混合，則這些運算會以 float32 執行，並產生 float32 輸出，無論是否啟用自動轉換。

建議使用 `binary_cross_entropy_with_logits` 而不是 `binary_cross_entropy`¶

torch.nn.functional.binary_cross_entropy()（以及包裝它的 torch.nn.BCELoss）的反向傳播可能會產生無法以 float16 表示的梯度。在啟用自動轉換的區域中，正向輸入可能是 float16，這表示反向梯度必須可以以 float16 表示（將 float16 正向輸入自動轉換為 float32 無濟於事，因為該轉換必須在反向傳播中反轉）。因此，binary_cross_entropy 和 BCELoss 在啟用自動轉換的區域中會引發錯誤。

許多模型在二元交叉熵層之前使用 sigmoid 層。在這種情況下，請使用 torch.nn.functional.binary_cross_entropy_with_logits() 或 torch.nn.BCEWithLogitsLoss 組合這兩層。binary_cross_entropy_with_logits 和 BCEWithLogits 可以安全地進行自動轉換。

XPU 運算特定行為（實驗性）¶

以下列表描述了在啟用自動轉換的區域中符合資格的運算的行為。無論這些運算是作為 torch.nn.Module 的一部分、作為函數還是作為 torch.Tensor 方法調用，它們始終會進行自動轉換。如果函數在多個命名空間中公開，則無論命名空間如何，它們都會進行自動轉換。

未在下方列出的運算不會進行自動轉換。它們以其輸入定義的類型運行。但是，如果未列出的運算位於自動轉換運算的下游，則自動轉換仍然可以更改其運行類型。

如果未列出運算，我們假設它在 float16 中數值穩定。如果您認為未列出的運算在 float16 中數值不穩定，請提交問題。

可以自動轉換為 `float16` 的 XPU 運算 ¶

addbmm、addmm、addmv、addr、baddbmm、bmm、chain_matmul、multi_dot、conv1d、conv2d、conv3d、conv_transpose1d、conv_transpose2d、conv_transpose3d、GRUCell、linear、LSTMCell、matmul、mm、mv、RNNCell

可以自動轉換為 `float32` 的 XPU 運算 ¶

__pow__、__rdiv__、__rpow__、__rtruediv__、binary_cross_entropy_with_logits、cosine_embedding_loss、cosine_similarity、cumsum、dist、exp、group_norm、hinge_embedding_loss、kl_div、l1_loss、layer_norm、log、log_softmax、margin_ranking_loss、nll_loss、normalize、poisson_nll_loss、pow、reciprocal、rsqrt、soft_margin_loss、softmax、softmin、sum、triplet_margin_loss

會提升至最廣輸入類型的 XPU 運算 ¶

這些運算不需要特定的 dtype 來確保穩定性，但會接收多個輸入，並且需要輸入的 dtype 相符。如果所有輸入都是 float16，則運算會以 float16 執行。如果有任何輸入是 float32，則自動轉換會將所有輸入轉換為 float32，並以 float32 執行運算。

bilinear、cross、grid_sample、index_put、scatter_add、tensordot

這裡未列出的某些運算（例如，二元運算，如 add）會在沒有自動轉換介入的情況下原生提升輸入。如果輸入是 float16 和 float32 的混合，則這些運算會以 float32 執行，並產生 float32 輸出，無論是否啟用自動轉換。

CPU 運算特定行為 ¶

以下列表描述了在啟用自動轉換的區域中符合資格的運算的行為。無論這些運算是作為 torch.nn.Module 的一部分、作為函數還是作為 torch.Tensor 方法調用，它們始終會進行自動轉換。如果函數在多個命名空間中公開，則無論命名空間如何，它們都會進行自動轉換。

未在下方列出的運算不會進行自動轉換。它們以其輸入定義的類型運行。但是，如果未列出的運算位於自動轉換運算的下游，則自動轉換仍然可以更改其運行類型。

如果未列出某個運算，我們假設它在 bfloat16 中數值穩定。如果您認為未列出的運算在 bfloat16 中數值不穩定，請提出問題。

可以自動轉換為 `bfloat16` 的 CPU 運算 ¶

conv1d、conv2d、conv3d、bmm、mm、baddbmm、addmm、addbmm、linear、matmul、_convolution

可以自動轉換為 `float32` 的 CPU 運算 ¶

conv_transpose1d、conv_transpose2d、conv_transpose3d、avg_pool3d、binary_cross_entropy、grid_sampler、grid_sampler_2d、_grid_sampler_2d_cpu_fallback、grid_sampler_3d、polar、prod、quantile、nanquantile、stft、cdist、trace、view_as_complex、cholesky、cholesky_inverse、cholesky_solve、inverse、lu_solve、orgqr、inverse、ormqr、pinverse、max_pool3d、max_unpool2d、max_unpool3d、adaptive_avg_pool3d、reflection_pad1d、reflection_pad2d、replication_pad1d、replication_pad2d、replication_pad3d、mse_loss、ctc_loss、kl_div、multilabel_margin_loss、fft_fft、fft_ifft、fft_fft2、fft_ifft2、fft_fftn、fft_ifftn、fft_rfft、fft_irfft、fft_rfft2、fft_irfft2、fft_rfftn、fft_irfftn、fft_hfft、fft_ihfft、linalg_matrix_norm、linalg_cond、linalg_matrix_rank、linalg_solve、linalg_cholesky、linalg_svdvals、linalg_eigvals、linalg_eigvalsh、linalg_inv、linalg_householder_product、linalg_tensorinv、linalg_tensorsolve、fake_quantize_per_tensor_affine、eig、geqrf、lstsq、_lu_with_info、qr、solve、svd、symeig、triangular_solve、fractional_max_pool2d、fractional_max_pool3d、adaptive_max_pool3d、multilabel_margin_loss_forward、linalg_qr、linalg_cholesky_ex、linalg_svd、linalg_eig、linalg_eigh、linalg_lstsq、linalg_inv_ex

會提升至最廣輸入類型的 CPU 運算 ¶

這些運算不需要特定的 dtype 來維持穩定性，但需要多個輸入，並且要求輸入的 dtype 相符。如果所有輸入都是 bfloat16，則運算會以 bfloat16 執行。如果任何輸入是 float32，則自動轉換會將所有輸入轉換為 float32，並以 float32 執行運算。

cat、stack、index_copy

這裡未列出的一些運算（例如，二元運算，如 add）會在本機提升輸入，而無需自動轉換的介入。如果輸入是 bfloat16 和 float32 的混合，則這些運算會以 float32 執行並產生 float32 輸出，無論是否啟用自動轉換。

自動混合精度套件 - torch.amp¶

自動轉換 ¶

梯度縮放 ¶

自動轉換運算參考 ¶

運算資格 ¶

CUDA 運算特定行為 ¶

可以自動轉換為 `float16` 的 CUDA 運算 ¶

可以自動轉換為 `float32` 的 CUDA 運算 ¶

會提升至最廣輸入類型的 CUDA 運算 ¶

建議使用 `binary_cross_entropy_with_logits` 而不是 `binary_cross_entropy`¶

XPU 運算特定行為（實驗性）¶

可以自動轉換為 `float16` 的 XPU 運算 ¶

可以自動轉換為 `float32` 的 XPU 運算 ¶

會提升至最廣輸入類型的 XPU 運算 ¶

CPU 運算特定行為 ¶

可以自動轉換為 `bfloat16` 的 CPU 運算 ¶

可以自動轉換為 `float32` 的 CPU 運算 ¶

會提升至最廣輸入類型的 CPU 運算 ¶

文件

教學課程

資源