量化¶

警告

量化功能目前處於 Beta 階段，可能會發生變化。

量化簡介¶

量化指的是用於以低於浮點精度的位寬執行計算和儲存張量的技術。量化模型對張量執行部分或全部操作時，使用較低精度而不是全精度（浮點）值。這使得模型表示更緊湊，並在許多硬體平臺上可以使用高效能向量化操作。與典型的 FP32 模型相比，PyTorch 支援 INT8 量化，可將模型大小減少 4 倍，記憶體頻寬需求減少 4 倍。硬體對 INT8 計算的支援通常比 FP32 計算快 2 到 4 倍。量化主要是為了加速推理，並且量化運算子僅支援前向傳播。

PyTorch 支援多種對深度學習模型進行量化的方法。在大多數情況下，模型在 FP32 中訓練，然後轉換為 INT8。此外，PyTorch 還支援量化感知訓練 (quantization aware training)，它使用偽量化模組對前向和後向傳播中的量化誤差進行建模。請注意，整個計算都在浮點數中進行。量化感知訓練結束後，PyTorch 提供轉換函式將訓練好的模型轉換為較低精度。

在較低級別，PyTorch 提供了一種表示量化張量並對其進行操作的方式。它們可以直接用於構建在較低精度下執行全部或部分計算的模型。更高層次的 API 則提供典型的工作流程，以最小的精度損失將 FP32 模型轉換為較低精度。

量化 API 概述¶

PyTorch 提供三種不同的量化模式：Eager 模式量化、FX 圖模式量化（維護中）和 PyTorch 2 Export 量化。

Eager 模式量化是 Beta 功能。使用者需要手動進行融合並指定量化和反量化發生的位置，並且它只支援模組而不支援函式式操作 (functionals)。

FX 圖模式量化是 PyTorch 中的自動化量化工作流程，目前是原型功能，並且由於有了 PyTorch 2 Export 量化而處於維護模式。它透過增加對函式式操作的支援和自動化量化過程來改進 Eager 模式量化，儘管使用者可能需要重構模型以使其與 FX 圖模式量化相容（即可使用 torch.fx 進行符號跟蹤）。請注意，FX 圖模式量化預計不適用於任意模型，因為模型可能無法進行符號跟蹤。我們將其整合到 torchvision 等領域庫中，使用者將能夠使用 FX 圖模式量化對支援的領域庫中的模型進行量化。對於任意模型，我們將提供通用指南，但要實際使其工作，使用者可能需要熟悉 torch.fx，特別是如何使模型可符號跟蹤。

PyTorch 2 Export 量化是新的完整圖模式量化工作流程，在 PyTorch 2.1 中作為原型功能釋出。隨著 PyTorch 2 的推出，我們正在轉向一個更好的完整程式捕獲解決方案 (torch.export)，因為它比 FX 圖模式量化使用的程式捕獲解決方案 torch.fx.symbolic_trace（在 14K 模型上為 72.7%）能捕獲更高比例的模型（在 14K 模型上為 88.8%）。torch.export 在某些 Python 構造方面仍然存在限制，並且需要使用者參與以支援匯出模型中的動態性，但總的來說，它是對先前程式捕獲解決方案的改進。PyTorch 2 Export 量化是為 torch.export 捕獲的模型而構建的，同時考慮了建模使用者和後端開發人員的靈活性和生產力。主要特點是 (1). 可程式設計 API，用於配置模型量化方式，可擴充套件到更多用例 (2). 簡化的使用者體驗，建模使用者和後端開發人員只需與單個物件 (Quantizer) 互動，即可表達使用者關於如何量化模型以及後端支援哪些功能。 (3). 可選的參考量化模型表示，可以使用整數操作表示量化計算，更接近實際硬體中發生的量化計算。

鼓勵量化新使用者首先嚐試 PyTorch 2 Export 量化，如果效果不佳，可以嘗試 eager 模式量化。

下表比較了 Eager 模式量化、FX 圖模式量化和 PyTorch 2 Export 量化之間的差異

	Eager 模式量化	FX 圖模式量化	PyTorch 2 Export 量化
釋出狀態	Beta	原型 (維護中)	原型
運算子融合	手動	自動	自動
量化/反量化放置	手動	自動	自動
量化模組	支援	支援	支援
量化函式式操作/Torch 運算子	手動	自動	支援
支援自定義	有限支援	完全支援	完全支援
量化模式支援	訓練後量化：靜態、動態、僅權重量化感知訓練：靜態	訓練後量化：靜態、動態、僅權重量化感知訓練：靜態	由後端特定的 Quantizer 定義
輸入/輸出模型型別	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重構以使模型與 FX 圖模式量化相容）	`torch.fx.GraphModule`（由 `torch.export` 捕獲）

支援三種類型的量化

動態量化（權重被量化，而啟用值以浮點形式讀取/儲存，並在計算時進行量化）
靜態量化（權重被量化，啟用值被量化，訓練後需要校準）
靜態量化感知訓練（權重被量化，啟用值被量化，量化數值在訓練期間建模）

請參閱我們的 PyTorch 量化簡介部落格文章，以更全面地瞭解這些量化型別之間的權衡。

運算子支援範圍在動態量化和靜態量化之間有所不同，如下表所示。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	是是	是否
nn.LSTM nn.GRU	是（透過自定義模組）否	是是
nn.RNNCell nn.GRUCell nn.LSTMCell	否否否	是是是
nn.EmbeddingBag	是（啟用值為 fp32）	是
nn.Embedding	是	是
nn.MultiheadAttention	是（透過自定義模組）	不支援
啟用值	廣泛支援	未改變，計算仍為 fp32

Eager 模式量化¶

關於量化流程的總體介紹，包括不同型別的量化，請參閱量化總體流程。

訓練後動態量化¶

這是最簡單的量化應用形式，其中權重預先量化，而啟用值在推理期間動態量化。這適用於模型執行時間主要由從記憶體載入權重而非計算矩陣乘法主導的情況。對於批次大小較小的 LSTM 和 Transformer 型別模型而言，情況正是如此。

圖示

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

PTDQ API 示例

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

要了解更多關於動態量化的資訊，請參閱我們的動態量化教程。

訓練後靜態量化¶

訓練後靜態量化（PTQ 靜態）量化模型的權重和啟用值。它在可能的情況下將啟用值融合到前一層。它需要使用代表性資料集進行校準，以確定啟用值的最佳量化引數。訓練後靜態量化通常用於記憶體頻寬和計算節省都很重要的情況，CNN 是典型的用例。

在應用訓練後靜態量化之前，我們可能需要修改模型。請參閱Eager 模式靜態量化模型準備。

圖示

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

PTSQ API 示例

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解更多關於靜態量化的資訊，請參閱靜態量化教程。

靜態量化的量化感知訓練¶

量化感知訓練（QAT）在訓練期間對量化效果進行建模，與其它量化方法相比可以獲得更高的精度。我們可以對靜態、動態或僅權重量化進行 QAT。在訓練期間，所有計算都在浮點數中完成，使用 fake_quant 模組透過鉗位和四捨五入來模擬 INT8 的效果，從而模擬量化效果。模型轉換後，權重和啟用值被量化，並在可能的情況下將啟用值融合到前一層。它常用於 CNN，與靜態量化相比可獲得更高的精度。

在應用訓練後靜態量化之前，我們可能需要修改模型。請參閱Eager 模式靜態量化模型準備。

圖示

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

QAT API 示例

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解更多關於量化感知訓練的資訊，請參閱QAT 教程。

Eager 模式靜態量化模型準備¶

目前需要對 Eager 模式量化之前的模型定義進行一些修改。這是因為當前量化是基於模組進行的。具體來說，對於所有量化技術，使用者需要：

將任何需要輸出重新量化（因此有額外引數）的操作從函式式形式轉換為模組形式（例如，使用 torch.nn.ReLU 而不是 torch.nn.functional.relu）。
透過在子模組上分配 .qconfig 屬性或指定 qconfig_mapping 來指定模型需要量化的部分。例如，設定 model.conv1.qconfig = None 意味著 model.conv 層將不會被量化，設定 model.linear1.qconfig = custom_qconfig 意味著 model.linear1 的量化設定將使用 custom_qconfig 而不是全域性 qconfig。

對於量化啟用值的靜態量化技術，使用者還需要執行以下操作：

指定啟用值在哪裡進行量化和反量化。這透過使用 QuantStub 和 DeQuantStub 模組完成。
使用 FloatFunctional 將需要特殊量化處理的張量操作包裝成模組。例如，像 add 和 cat 這樣的操作，它們需要特殊處理來確定輸出量化引數。
融合模組：將操作/模組組合成一個模組，以獲得更高的精度和效能。這透過使用 fuse_modules() API 完成，該 API 接收要融合的模組列表。我們目前支援以下融合：[Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]。

（原型 - 維護模式）FX 圖模式量化¶

訓練後量化（僅權重、動態和靜態）中有多種量化型別，配置透過 qconfig_mapping（prepare_fx 函式的一個引數）完成。

FXPTQ API 示例

import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

請按照以下教程瞭解更多關於 FX 圖模式量化的資訊

（原型）PyTorch 2 Export 量化¶

API 示例

import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.export import export_for_training
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = export_for_training(m, *example_inputs).module()
# we get a model with aten ops

# Step 2. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3. lowering
# lower to target backend

請按照這些教程開始使用 PyTorch 2 Export 量化

模型使用者

後端開發人員（請也檢視所有模型使用者文件）

如何為 PyTorch 2 Export 量化編寫 Quantizer

量化堆疊¶

量化是將浮點模型轉換為量化模型的過程。因此，從高層次來看，量化堆疊可以分為兩部分：1). 量化模型的構建塊或抽象 2). 將浮點模型轉換為量化模型的量化流程的構建塊或抽象。

量化模型¶

量化張量¶

為了在 PyTorch 中進行量化，我們需要能夠表示量化資料在張量中。量化張量允許儲存量化資料（表示為 int8/uint8/int32）以及量化引數，如 scale 和 zero_point。量化張量允許許多有用的操作，使量化算術變得容易，此外還允許以量化格式序列化資料。

PyTorch 支援逐張量（per tensor）和逐通道（per channel）的對稱（symmetric）和非對稱（asymmetric）量化。逐張量意味著張量內的所有值都以相同的方式使用相同的量化引數進行量化。逐通道意味著對於每個維度（通常是張量的通道維度），張量中的值使用不同的量化引數進行量化。這減少了將張量轉換為量化值時的誤差，因為異常值只會影響其所在的通道，而不是整個張量。

對映透過以下方式使用浮點張量進行轉換：

$_images/math-quantizer-equation.png$

注意，我們確保浮點數中的零在量化後能無誤差地表示，從而確保像 padding 這樣的操作不會引起額外的量化誤差。

以下是量化張量的幾個關鍵屬性

QScheme (torch.qscheme)：一個列舉，指定量化張量的方式
- torch.per_tensor_affine
- torch.per_tensor_symmetric
- torch.per_channel_affine
- torch.per_channel_symmetric
dtype (torch.dtype)：量化張量的資料型別
- torch.quint8
- torch.qint8
- torch.qint32
- torch.float16
量化引數（根據 QScheme 不同而異）：選定量化方式的引數
- torch.per_tensor_affine 將具有以下量化引數
  - scale (浮點數)
  - zero_point (整數)
- torch.per_channel_affine 將具有以下量化引數
  - per_channel_scales (浮點數列表)
  - per_channel_zero_points (整數列表)
  - axis (整數)

量化與反量化¶

模型的輸入和輸出是浮點張量，但量化模型中的啟用值是量化的，因此我們需要運算子在浮點張量和量化張量之間進行轉換。

量化 (浮點 -> 量化)
- torch.quantize_per_tensor(x, scale, zero_point, dtype)
- torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
- torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)
- to(torch.float16)
反量化 (量化 -> 浮點)
- quantized_tensor.dequantize() - 在 torch.float16 張量上呼叫 dequantize 將把張量轉換回 torch.float
- torch.dequantize(x)

量化運算子/模組¶

量化運算子是接受量化張量作為輸入並輸出量化張量的運算子。
量化模組是執行量化操作的 PyTorch 模組。它們通常為加權操作（如 linear 和 conv）定義。

量化引擎¶

執行量化模型時，qengine (torch.backends.quantized.engine) 指定用於執行的後端。重要的是確保 qengine 與量化模型在量化啟用值和權重的取值範圍方面相容。

量化流程¶

Observer 和 FakeQuantize¶

Observer 是用於以下目的的 PyTorch 模組：
- 收集透過 Observer 的張量統計資訊，例如最小值和最大值
- 並根據收集到的張量統計資訊計算量化引數
FakeQuantize 是用於以下目的的 PyTorch 模組：
- 模擬網路中張量的量化（執行量化/反量化）
- 它可以根據從 Observer 收集的統計資訊計算量化引數，也可以學習量化引數

QConfig¶

QConfig 是 Observer 或 FakeQuantize 模組類的命名元組，可以使用 qscheme、dtype 等進行配置。它用於配置如何對運算子進行觀察。
- 運算子/模組的量化配置
  - 不同型別的 Observer/FakeQuantize
  - dtype
  - qscheme
  - quant_min/quant_max：可用於模擬較低精度張量
- 當前支援啟用值和權重的配置
- 我們根據為給定運算子或模組配置的 qconfig 插入輸入/權重/輸出 Observer

量化總體流程¶

通常，流程如下：

準備 (prepare)
- 根據使用者指定的 qconfig 插入 Observer/FakeQuantize 模組
校準/訓練（取決於訓練後量化還是量化感知訓練）
- 允許 Observer 收集統計資訊或 FakeQuantize 模組學習量化引數
轉換 (convert)
- 將校準/訓練後的模型轉換為量化模型

量化有不同的模式，可以從兩個方面分類

根據應用量化流程的位置，我們有

訓練後量化（在訓練後應用量化，量化引數基於樣本校準資料計算）
量化感知訓練（在訓練期間模擬量化，以便使用訓練資料學習量化引數和模型）

根據我們如何量化運算子，我們可以有

僅權重量化（只有權重是靜態量化的）
動態量化（權重是靜態量化的，啟用值是動態量化的）
靜態量化（權重和啟用值都是靜態量化的）

我們可以在同一個量化流程中混合不同的運算子量化方式。例如，我們可以在訓練後量化中同時包含靜態量化和動態量化運算子。

量化支援矩陣¶

量化模式支援¶

	量化模式		資料集需求	最適用於	精度	註釋
訓練後量化	動態/僅權重量化	啟用值動態量化（fp16, int8）或不量化，權重靜態量化（fp16, int8, in4）	無	LSTM, MLP, Embedding, Transformer	良好	易於使用，當效能受權重計算或記憶體限制時，接近靜態量化
訓練後量化	靜態量化	靜態量化	啟用值和權重靜態量化 (int8)	校準資料集	良好	CNN
提供最佳效能，可能對精度影響較大，適用於僅支援 int8 計算的硬體	動態量化	量化感知訓練	啟用值和權重是偽量化的	微調資料集	MLP, Embedding	最佳
提供最佳效能，可能對精度影響較大，適用於僅支援 int8 計算的硬體	靜態量化	量化感知訓練	啟用值和權重是偽量化的	CNN, MLP, Embedding	MLP, Embedding	通常在靜態量化導致精度不佳時使用，用於彌補精度差距

請參閱我們的PyTorch 量化簡介部落格文章，以更全面地瞭解這些量化型別之間的權衡。

量化流程支援¶

PyTorch 提供兩種量化模式：Eager 模式量化和 FX 圖模式量化。

Eager 模式量化是 Beta 功能。使用者需要手動進行融合並指定量化和反量化發生的位置，並且它只支援模組而不支援函式式操作 (functionals)。

FX Graph Mode Quantization 是 PyTorch 中的一種自動化量化框架，目前是一個原型功能。它在 Eager Mode Quantization 的基礎上進行了改進，增加了對函式（functionals）的支援並自動化了量化過程，不過使用者可能需要重構模型以使其與 FX Graph Mode Quantization 相容（即可以使用 torch.fx 進行符號追蹤）。請注意，FX Graph Mode Quantization 並非預期適用於任意模型，因為模型可能無法進行符號追蹤。我們將把此功能整合到 torchvision 等領域庫中，使用者將能夠使用 FX Graph Mode Quantization 量化與支援的領域庫中類似的模型。對於任意模型，我們將提供一般性指導，但要使其真正工作，使用者可能需要熟悉 torch.fx，尤其是如何使模型可進行符號追蹤。

鼓勵量化的新使用者首先嚐試 FX Graph Mode Quantization，如果不行，使用者可以嘗試按照使用 FX Graph Mode Quantization的指南操作，或回退到 eager mode quantization。

下表比較了 Eager Mode Quantization 和 FX Graph Mode Quantization 的區別

	Eager 模式量化	FX 圖模式量化
釋出狀態	Beta	原型
運算子融合	手動	自動
量化/反量化放置	手動	自動
量化模組	支援	支援
量化函式式操作/Torch 運算子	手動	自動
支援自定義	有限支援	完全支援
量化模式支援	訓練後量化：靜態、動態、僅權重量化感知訓練：靜態	訓練後量化：靜態、動態、僅權重量化感知訓練：靜態
輸入/輸出模型型別	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重構以使模型與 FX 圖模式量化相容）

後端/硬體支援¶

硬體	核函式庫	Eager 模式量化	FX 圖模式量化	量化模式支援
伺服器 CPU	fbgemm/onednn	支援		全部支援
移動 CPU	qnnpack/xnnpack	支援		全部支援
伺服器 GPU	TensorRT (早期原型)	不支援，因為它需要一個圖	支援	靜態量化

目前，PyTorch 支援以下後端高效執行量化運算元

具有 AVX2 或更高版本支援的 x86 CPU（沒有 AVX2 時，一些操作的實現效率較低），透過由 fbgemm 和 onednn 最佳化的 x86（詳情請參閱 RFC）
ARM CPU（通常用於移動/嵌入式裝置），透過 qnnpack
透過 TensorRT (透過 fx2trt，即將開源) 對 NVidia GPU 的支援（早期原型）

原生 CPU 後端注意事項¶

我們透過相同的原生 PyTorch 量化運算元暴露了 x86 和 qnnpack，因此我們需要額外的標誌來區分它們。選擇 x86 和 qnnpack 對應的實現是根據 PyTorch 構建模式自動進行的，不過使用者可以選擇設定 torch.backends.quantization.engine 為 x86 或 qnnpack 來覆蓋此設定。

準備量化模型時，需要確保 qconfig 和用於量化計算的 engine 與模型將在其上執行的後端匹配。qconfig 控制量化過程中使用的 observer 型別。qengine 控制在為 linear 和 convolution 函式和模組打包權重時是否使用 x86 或 qnnpack 特定的打包函式。例如：

x86 的預設設定

# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'

qnnpack 的預設設定

# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

運算元支援¶

運算元覆蓋率在動態量化和靜態量化之間有所不同，如下表所示。請注意，對於 FX Graph Mode Quantization，也支援相應的函式（functionals）。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	是是	是否
nn.LSTM nn.GRU	否否	是是
nn.RNNCell nn.GRUCell nn.LSTMCell	否否否	是是是
nn.EmbeddingBag	是（啟用值為 fp32）	是
nn.Embedding	是	是
nn.MultiheadAttention	不支援	不支援
啟用值	廣泛支援	未改變，計算仍為 fp32

注意：這將很快更新一些從原生 backend_config_dict 生成的資訊。

量化 API 參考¶

量化 API 參考包含量化 API 的文件，例如量化過程、量化 tensor 操作以及支援的量化模組和函式。

量化後端配置¶

量化後端配置包含關於如何為各種後端配置量化工作流的文件。

量化精度除錯¶

量化精度除錯包含關於如何除錯量化精度的文件。

量化定製¶

雖然提供了基於觀測到的 tensor 資料選擇尺度因子和偏差的預設 observer 實現，但開發者可以提供自己的量化函式。量化可以有選擇地應用於模型的不同部分，或者為模型的不同部分進行不同的配置。

我們還支援對 conv1d()、conv2d()、conv3d() 和 linear() 進行逐通道量化。

量化工作流透過新增（例如，將 observer 作為 .observer 子模組新增）或替換（例如，將 nn.Conv2d 轉換為 nn.quantized.Conv2d）模型模組層次結構中的子模組來實現。這意味著模型在整個過程中保持一個常規的 nn.Module 例項，因此可以與 PyTorch 的其餘 API 協同工作。

量化自定義模組 API¶

Eager mode 和 FX graph mode 量化 API 都提供了一個 hook，供使用者以自定義方式指定量化模組，並使用使用者定義的觀測和量化邏輯。使用者需要指定：

源 fp32 模組（存在於模型中）的 Python 型別
觀測模組（由使用者提供）的 Python 型別。此模組需要定義一個 from_float 函式，該函式定義如何從原始 fp32 模組建立觀測模組。
量化模組（由使用者提供）的 Python 型別。此模組需要定義一個 from_observed 函式，該函式定義如何從觀測模組建立量化模組。
描述上述 (1)、(2)、(3) 的配置，傳遞給量化 API。

框架隨後將執行以下操作：

在 prepare 模組交換期間，它將使用 (2) 中類的 from_float 函式，將指定型別在 (1) 中的每個模組轉換為指定型別在 (2) 中的模組。
在 convert 模組交換期間，它將使用 (3) 中類的 from_observed 函式，將指定型別在 (2) 中的每個模組轉換為指定型別在 (3) 中的模組。

目前，要求 ObservedCustomModule 只有一個 Tensor 輸出，並且框架（而不是使用者）將在此輸出上新增一個 observer。observer 將作為自定義模組例項的屬性儲存在 activation_post_process 鍵下。將來可能會放寬這些限制。

自定義 API 示例

import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict)

最佳實踐¶

1. 如果您正在使用 x86 後端，我們需要使用 7 位而不是 8 位。請確保減小 quant\_min 和 quant\_max 的範圍，例如：如果 dtype 是 torch.quint8，請確保設定自定義 quant_min 為 0，quant_max 為 127 (255 / 2)；如果 dtype 是 torch.qint8，請確保設定自定義 quant_min 為 -64 (-128 / 2)，quant_max 為 63 (127 / 2)。如果您呼叫 torch.ao.quantization.get_default_qconfig(backend) 或 torch.ao.quantization.get_default_qat_qconfig(backend) 函式來獲取 x86 或 qnnpack 後端的預設 qconfig，我們已經正確設定了這些值。

2. 如果選擇了 onednn 後端，在預設 qconfig 對映 torch.ao.quantization.get_default_qconfig_mapping('onednn') 和預設 qconfig torch.ao.quantization.get_default_qconfig('onednn') 中，啟用將使用 8 位。建議在支援向量神經網路指令 (VNNI) 的 CPU 上使用。否則，設定啟用的 observer 的 reduce_range 為 True，以在沒有 VNNI 支援的 CPU 上獲得更好的精度。

常見問題解答¶

我如何在 GPU 上進行量化推理？

我們目前還沒有官方的 GPU 支援，但這正在積極開發中。您可以在此處找到更多資訊。
我的量化模型在哪裡可以獲得 ONNX 支援？

如果您在匯出模型時（使用 torch.onnx 下的 API）遇到錯誤，您可以在 PyTorch 倉庫中提出問題。在問題標題前加上 [ONNX] 並標記問題為 module: onnx。

如果您在使用 ONNX Runtime 時遇到問題，請在GitHub - microsoft/onnxruntime上提出問題。
我如何將量化與 LSTM 一起使用？

LSTM 在 eager mode 和 fx graph mode 量化中都透過我們的自定義模組 API 得到支援。示例可以在 Eager Mode: pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm FX Graph Mode: pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm 中找到。

常見錯誤¶

將非量化 Tensor 傳遞給量化核函式¶

如果您看到類似於以下內容的錯誤：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

這意味著您嘗試將非量化 Tensor 傳遞給量化核函式。一個常見的解決方法是使用 torch.ao.quantization.QuantStub 對 Tensor 進行量化。在 Eager mode 量化中，這需要手動完成。一個端到端示例：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

將量化 Tensor 傳遞給非量化核函式¶