量化¶

警告

量化功能处于测试阶段，可能会发生变化。

量化简介¶

量化是指以比浮点精度更低的位宽执行计算和存储张量的技术。量化模型使用降低的精度而不是全精度（浮点）值对张量执行部分或全部运算。这允许更紧凑的模型表示，并在许多硬件平台上使用高性能的向量化运算。与典型的 FP32 模型相比，PyTorch 支持 INT8 量化，可以将模型大小减少 4 倍，并将内存带宽要求减少 4 倍。与 FP32 计算相比，硬件对 INT8 计算的支持通常快 2 到 4 倍。量化主要是一种加速推理的技术，量化运算符仅支持前向传递。

PyTorch 支持多种量化深度学习模型的方法。在大多数情况下，模型在 FP32 中训练，然后转换为 INT8。此外，PyTorch 还支持量化感知训练，它使用伪量化模块对前向和反向传递中的量化误差进行建模。请注意，整个计算都是以浮点进行的。在量化感知训练结束时，PyTorch 提供了转换函数，可以将训练好的模型转换为更低的精度。

在较低级别，PyTorch 提供了一种表示量化张量并对其执行运算的方法。它们可用于直接构建以较低精度执行部分或全部计算的模型。提供了更高级别的 API，它们包含将 FP32 模型转换为更低精度并最大程度减少精度损失的典型工作流程。

量化 API 摘要¶

PyTorch 提供了三种不同的量化模式：Eager 模式量化、FX 图模式量化（维护）和 PyTorch 2 导出量化。

Eager 模式量化是一个测试功能。用户需要手动进行融合并指定量化和反量化发生的位置，而且它只支持模块，不支持函数。

FX 图模式量化是 PyTorch 中的一种自动量化工作流程，目前它是一个原型功能，因为它具有 PyTorch 2 导出量化，所以它处于维护模式。它通过添加对函数的支持和自动化量化过程来改进 Eager 模式量化，尽管人们可能需要重构模型以使模型与 FX 图模式量化兼容（使用 torch.fx 进行符号跟踪）。请注意，FX 图模式量化预计不会在任意模型上工作，因为模型可能无法进行符号跟踪，我们会将其集成到 torchvision 等领域库中，用户将能够使用 FX 图模式量化对类似于受支持领域库中的模型进行量化。对于任意模型，我们将提供一般性指南，但要使其真正发挥作用，用户可能需要熟悉 torch.fx，尤其是如何使模型进行符号跟踪。

PyTorch 2 导出量化是新的全图模式量化工作流程，在 PyTorch 2.1 中作为原型功能发布。在 PyTorch 2 中，我们正在转向一种更好的全程序捕获解决方案（torch.export），因为它可以捕获更高比例的模型（在 14K 模型上为 88.8%），而 FX 图模式量化使用的程序捕获解决方案 torch.fx.symbolic_trace 只能捕获 72.7% 的模型（在 14K 模型上）。torch.export 在某些 Python 结构方面仍然存在局限性，并且需要用户参与才能支持导出模型的动态性，但总体而言，它比以前的程序捕获解决方案有所改进。PyTorch 2 导出量化是为由 torch.export 捕获的模型而构建的，它兼顾了建模用户和后端开发人员的灵活性和生产力。其主要特点是：(1). 可编程 API，用于配置模型的量化方式，可以扩展到更多用例。(2). 简化了建模用户和后端开发人员的用户体验，因为他们只需要与一个对象（量化器）交互，即可表达用户关于如何量化模型以及后端支持的意图。(3). 可选的参考量化模型表示，可以使用更接近硬件中实际量化计算的整数运算来表示量化计算。

建议量化的新用户首先尝试 PyTorch 2 导出量化，如果效果不佳，可以尝试 Eager 模式量化。

下表比较了 Eager 模式量化、FX 图模式量化和 PyTorch 2 导出量化之间的差异

	Eager 模式量化	FX 图模式量化	PyTorch 2 导出量化
发布状态	测试版	原型（维护）	原型
运算符融合	手动	自动	自动
量化/反量化放置	手动	自动	自动
量化模块	支持	支持	支持
量化函数/Torch 运算	手动	自动	支持
自定义支持	有限支持	完全支持	完全支持
量化模式支持	训练后量化：静态、动态、仅权重量化感知训练：静态	训练后量化：静态、动态、仅权重量化感知训练：静态	由后端特定的量化器定义
輸入/輸出模型類型	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重構才能使模型與 FX 圖形模式量化相容）	`torch.fx.GraphModule`（由 `torch.export` 擷取）

支援三種類型的量化

動態量化（權重量化，激活以浮點數讀取/儲存並量化以進行計算）
靜態量化（權重量化，激活量化，訓練後需要校準）
靜態量化感知訓練（權重量化，激活量化，在訓練期間對量化數值進行建模）

如需這些量化類型之間的權衡更全面的概述，請參閱我們的部落格文章《PyTorch 量化簡介》。

運算子涵蓋範圍因動態和靜態量化而異，並在下表中列出。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	是是	是否
nn.LSTM nn.GRU	是（透過自訂模組）否	是是
nn.RNNCell nn.GRUCell nn.LSTMCell	否否否	是是是
nn.EmbeddingBag	是（激活為 fp32）	是
nn.Embedding	是	是
nn.MultiheadAttention	是（透過自訂模組）	不支援
激活	廣泛支援	未變更，計算保持在 fp32 中

Eager 模式量化¶

如需量化流程的一般簡介，包括不同類型的量化，請參閱一般量化流程。

訓練後動態量化¶

這是最容易應用的量化形式，其中權重會提前量化，但激活會在推論期間動態量化。這種方法適用於模型執行時間主要由從記憶體載入權重而不是計算矩陣乘法所主導的情況。對於具有小批次大小的 LSTM 和 Transformer 類型模型來說，情況確實如此。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

PTDQ API 範例

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

如需深入瞭解動態量化，請參閱我們的動態量化教學課程。

訓練後靜態量化¶

訓練後靜態量化（PTQ 靜態）會量化模型的權重和激活。它會盡可能將激活融合到前面的層中。它需要使用具代表性的資料集進行校準，以確定激活的最佳量化參數。訓練後靜態量化通常用於記憶體頻寬和計算節省都很重要的情況，而 CNN 就是一個典型的用例。

在應用訓練後靜態量化之前，我們可能需要修改模型。請參閱為 Eager 模式靜態量化準備模型。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

PTSQ API 範例

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

如需深入瞭解靜態量化，請參閱靜態量化教學課程。

靜態量化的量化感知訓練¶

量化感知訓練 (QAT) 會在訓練期間對量化的影響進行建模，與其他量化方法相比，可以實現更高的準確度。我們可以為靜態、動態或僅權重量化執行 QAT。在訓練期間，所有計算都以浮點數完成，並使用 fake_quant 模組對量化的影響進行建模，方法是透過夾緊和捨入來模擬 INT8 的影響。模型轉換後，權重和激活會被量化，並且激活會盡可能融合到前面的層中。它通常與 CNN 一起使用，並且與靜態量化相比產生更高的準確度。

在應用訓練後靜態量化之前，我們可能需要修改模型。請參閱為 Eager 模式靜態量化準備模型。

圖表

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

QAT API 範例

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

如需深入瞭解量化感知訓練，請參閱QAT 教學課程。

為 Eager 模式靜態量化準備模型¶

目前有必要在 Eager 模式量化之前對模型定義進行一些修改。這是因為目前量化是在逐個模組的基礎上進行的。具體來說，對於所有量化技術，使用者都需要

將任何需要輸出重新量化（因此具有額外參數）的運算從函數形式轉換為模組形式（例如，使用 torch.nn.ReLU 而不是 torch.nn.functional.relu）。
透過在子模組上指定 .qconfig 屬性或指定 qconfig_mapping 來指定模型的哪些部分需要量化。例如，設定 model.conv1.qconfig = None 表示 model.conv 層將不會被量化，而設定 model.linear1.qconfig = custom_qconfig 表示 model.linear1 的量化設定將使用 custom_qconfig 而不是全域 qconfig。

對於量化激活的靜態量化技術，使用者需要額外執行以下操作

指定在哪裡量化和去量化激活。這是使用 QuantStub 和 DeQuantStub 模組完成的。
使用 FloatFunctional 將需要特殊處理以進行量化的張量運算包裝到模組中。例如，add 和 cat 等運算需要特殊處理才能確定輸出量化參數。
融合模組：將運算/模組組合成單個模組，以獲得更高的準確度和效能。這是使用 fuse_modules() API 完成的，該 API 接收要融合的模組列表。我們目前支援以下融合：[Conv, Relu]、[Conv, BatchNorm]、[Conv, BatchNorm, Relu]、[Linear, Relu]

（原型 - 維護模式）FX 圖形模式量化¶

訓練後量化中有多種類型的量化（僅權重、動態和靜態），並且配置是透過 qconfig_mapping（prepare_fx 函數的一個參數）完成的。

FXPTQ API 範例

import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

請按照以下教學課程深入瞭解 FX 圖形模式量化

（原型）PyTorch 2 匯出量化¶

API 範例

import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3. lowering
# lower to target backend

請按照這些教學課程開始使用 PyTorch 2 匯出量化

建模使用者

後端開發人員（也請查看所有建模使用者文件）

如何為 PyTorch 2 匯出量化編寫量化器

量化堆疊¶

量化是將浮點模型轉換為量化模型的過程。因此，在高層次上，量化堆疊可以分為兩部分：1）。量化模型的構建塊或抽象 2）。將浮點模型轉換為量化模型的量化流程的構建塊或抽象

量化模型¶

量化張量¶

為了在 PyTorch 中進行量化，我們需要能夠在張量中表示量化資料。量化張量允許儲存量化資料（表示為 int8/uint8/int32）以及量化參數（如比例和零點）。量化張量允許許多有用的運算，使量化算術變得容易，此外還允許以量化格式序列化資料。

PyTorch 支援每張量和每通道對稱和非對稱量化。每張量意味著張量中的所有值都以相同的量化參數以相同的方式量化。每通道意味著對於每個維度（通常是張量的通道維度），張量中的值使用不同的量化參數進行量化。這允許在將張量轉換為量化值時減少錯誤，因為異常值只會影響它所在的通道，而不是整個張量。

映射是透過使用以下方法轉換浮點張量來執行的

$_images/math-quantizer-equation.png$

請注意，我們確保浮點數中的零在量化後以無誤差表示，從而確保填充等運算不會導致額外的量化誤差。

以下是量化張量的一些關鍵屬性

QScheme (torch.qscheme)：指定我們量化張量的方式的列舉
- torch.per_tensor_affine
- torch.per_tensor_symmetric
- torch.per_channel_affine
- torch.per_channel_symmetric
dtype (torch.dtype)：量化張量的資料類型
- torch.quint8
- torch.qint8
- torch.qint32
- torch.float16
量化參數（根據 QScheme 而異）：所選量化方式的參數
- torch.per_tensor_affine 將具有以下量化參數
  - 比例（浮點數）
  - 零點（整數）
- torch.per_channel_affine 將具有以下量化參數
  - per_channel_scales（浮點數列表）
  - per_channel_zero_points（整數列表）
  - 軸（整數）

量化和去量化¶

模型的輸入和輸出是浮點張量，但量化模型中的激活是量化的，因此我們需要運算子在浮點張量和量化張量之間進行轉換。

量化（浮點數 -> 量化）
- torch.quantize_per_tensor(x, scale, zero_point, dtype)
- torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
- torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)
- to(torch.float16)
去量化（量化 -> 浮點數）
- quantized_tensor.dequantize() - 對 torch.float16 張量呼叫 dequantize 會將張量轉換回 torch.float
- torch.dequantize(x)

量化運算子/模組¶

量化運算子是將量化張量作為輸入並輸出量化張量的運算子。
量化模組是執行量化運算的 PyTorch 模組。它們通常定義為加權運算，例如線性和卷積。

量化引擎¶

執行量化模型時，qengine (torch.backends.quantized.engine) 指定要使用哪個後端進行執行。重要的是要確保 qengine 在量化激活和權重的值範圍方面與量化模型相容。

量化流程¶

觀察器和 FakeQuantize¶

觀察器是 PyTorch 模組，用於
- 收集張量統計資訊，例如通過觀察器的張量的最小值和最大值
- 並根據收集到的張量統計資訊計算量化參數
FakeQuantize 是用於
- 模擬網路中張量的量化（執行量化/反量化）的 PyTorch 模組
- 它可以根據從觀察器收集的統計資訊計算量化參數，也可以學習量化參數

QConfig¶

QConfig 是一個 Observer 或 FakeQuantize 模組類別的 namedtuple，可以使用 qscheme、dtype 等進行配置。它用於配置應如何觀察運算子
- 運算子/模組的量化配置
  - 不同類型的 Observer/FakeQuantize
  - dtype
  - qscheme
  - quant_min/quant_max：可用於模擬較低精度的張量
- 目前支援激活和權重的配置
- 我們根據為給定運算子或模組配置的 qconfig 插入輸入/權重/輸出觀察器

一般量化流程¶

一般來說，流程如下

準備
- 根據使用者指定的 qconfig 插入 Observer/FakeQuantize 模組
校準/訓練（取決於訓練後量化或量化感知訓練）
- 允許觀察器收集統計資訊或 FakeQuantize 模組學習量化參數
轉換
- 將校準/訓練的模型轉換為量化模型

量化有不同的模式，可以從兩個方面進行分類

就我們應用量化流程的位置而言，我們有

訓練後量化（在訓練後應用量化，量化參數根據樣本校準數據計算）
量化感知訓練（在訓練期間模擬量化，以便可以使用訓練數據與模型一起學習量化參數）

就我們如何量化運算子的方式而言，我們可以有

僅權重量化（僅權重是靜態量化的）
動態量化（權重是靜態量化的，激活是動態量化的）
靜態量化（權重和激活都是靜態量化的）

我們可以在同一個量化流程中混合使用不同的運算子量化方式。例如，我們可以進行訓練後量化，其中同時包含靜態和動態量化的運算子。

量化支援矩陣¶

量化模式支援¶

	量化模式		數據集需求	最適合	準確性	備註
訓練後量化	動態/僅權重量化	激活動態量化（fp16，int8）或不量化，權重靜態量化（fp16，int8，in4）	無	LSTM、MLP、嵌入、Transformer	好	易於使用，當性能因權重而受計算或記憶體限制時，接近靜態量化
訓練後量化	靜態量化	激活和權重靜態量化（int8）	校準數據集	CNN	好	提供最佳性能，可能對準確性有很大影響，適合僅支援 int8 計算的硬體
量化感知訓練	動態量化	激活和權重是偽量化的	微調數據集	MLP、嵌入	最佳	目前支援有限
量化感知訓練	靜態量化	激活和權重是偽量化的	微調數據集	CNN、MLP、嵌入	最佳	通常在靜態量化導致準確性不佳時使用，用於縮小準確性差距

請參閱我們的 Pytorch 量化簡介部落格文章，以更全面地了解這些量化類型之間的權衡。

量化流程支援¶

PyTorch 提供兩種量化模式：Eager 模式量化和 FX 圖形模式量化。

Eager 模式量化是一个测试功能。用户需要手动进行融合并指定量化和反量化发生的位置，而且它只支持模块，不支持函数。

FX 圖形模式量化是 PyTorch 中的一個自動量化框架，目前它是一個原型功能。它通過增加對函數的支持和自動化量化過程來改進 Eager 模式量化，儘管人們可能需要重構模型以使模型與 FX 圖形模式量化兼容（可以使用 torch.fx 進行符號追踪）。請注意，FX 圖形模式量化預計不會適用於任意模型，因為模型可能無法進行符號追踪，我們會將其整合到 torchvision 等領域庫中，用戶將能夠使用 FX 圖形模式量化量化與支持的領域庫中的模型類似的模型。對於任意模型，我們將提供一般準則，但要使其真正起作用，用戶可能需要熟悉 torch.fx，尤其是在如何使模型可進行符號追踪方面。

鼓勵量化的新用戶首先嘗試 FX 圖形模式量化，如果它不起作用，用戶可以嘗試遵循使用 FX 圖形模式量化的準則或回退到 eager 模式量化。

下表比較了 Eager 模式量化和 FX 圖形模式量化之間的差異

	Eager 模式量化	FX 图模式量化
发布状态	测试版	原型
运算符融合	手动	自动
量化/反量化放置	手动	自动
量化模块	支持	支持
量化函数/Torch 运算	手动	自动
自定义支持	有限支持	完全支持
量化模式支持	训练后量化：静态、动态、仅权重量化感知训练：静态	训练后量化：静态、动态、仅权重量化感知训练：静态
輸入/輸出模型類型	`torch.nn.Module`	`torch.nn.Module`（可能需要一些重構才能使模型與 FX 圖形模式量化相容）

後端/硬體支援¶

硬體	內核庫	Eager 模式量化	FX 图模式量化	量化模式支持
伺服器 CPU	fbgemm/onednn	支持		所有受支援
行動 CPU	qnnpack/xnnpack	支持		所有受支援
伺服器 GPU	TensorRT（早期原型）	不支援，因為它需要圖形	支持	靜態量化

今天，PyTorch 支援以下後端來高效運行量化運算子

支援 AVX2 或更高版本的 x86 CPU（沒有 AVX2 的話，某些操作的實現效率低下），通過 x86 優化，由 fbgemm 和 onednn 優化（詳情請參閱 RFC）
ARM CPU（通常存在於行動/嵌入式設備中），通過 qnnpack
（早期原型）通過 fx2trt 支援通過 TensorRT 的 NVidia GPU（將開源）

原生 CPU 後端注意事項¶

我們使用相同的原生 pytorch 量化運算子公開 x86 和 qnnpack，因此我們需要額外的標誌來區分它們。 x86 和 qnnpack 的相應實現會根據 PyTorch 構建模式自動選擇，但用戶可以通過將 torch.backends.quantization.engine 設置為 x86 或 qnnpack 來覆蓋此設置。

準備量化模型時，需要確保 qconfig 和用於量化計算的引擎與將要執行模型的後端相匹配。 qconfig 控制量化過程中使用的觀察器類型。 qengine 控制在為線性和卷積函數和模組打包權重時是否使用 x86 或 qnnpack 特定的打包函數。例如

x86 的默認設置

# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'

qnnpack 的默認設置

# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

運算子支援¶

運算子覆蓋範圍因動態和靜態量化而異，並在下表中捕獲。請注意，對於 FX 圖形模式量化，也支援相應的函數。

	靜態量化	動態量化
nn.Linear nn.Conv1d/2d/3d	是是	是否
nn.LSTM nn.GRU	否否	是是
nn.RNNCell nn.GRUCell nn.LSTMCell	否否否	是是是
nn.EmbeddingBag	是（激活為 fp32）	是
nn.Embedding	是	是
nn.MultiheadAttention	不支援	不支援
激活	廣泛支援	未變更，計算保持在 fp32 中

注意：這將很快使用從原生 backend_config_dict 生成的一些資訊進行更新。

量化 API 參考¶

量化 API 參考包含量化 API 的文檔，例如量化過程、量化張量操作以及受支援的量化模組和函數。

量化後端配置¶

量化後端配置包含有關如何為各種後端配置量化工作流程的文檔。

量化準確性除錯¶

量化準確性除錯包含有關如何除錯量化準確性的文檔。

量化自訂¶

雖然提供了基於觀察到的張量數據選擇比例因子和偏差的觀察器默認實現，但開發人員可以提供他們自己的量化函數。量化可以有選擇地應用於模型的不同部分，或者為模型的不同部分進行不同的配置。

我們還為 conv1d()、conv2d()、conv3d() 和 linear() 提供每個通道量化的支援。

量化工作流程通過添加（例如，將觀察器添加為 .observer 子模組）或替換（例如，將 nn.Conv2d 轉換為 nn.quantized.Conv2d）模型的模組層次結構中的子模組來工作。這意味著模型在整個過程中保持為常規的基於 nn.Module 的實例，因此可以與 PyTorch API 的其餘部分一起使用。

量化自訂模組 API¶

Eager 模式和 FX 圖形模式量化 API 都為用戶提供了一個鉤子，用於以自訂方式指定模組量化，並使用用戶定義的觀察和量化邏輯。用戶需要指定

源 fp32 模組的 Python 類型（存在於模型中）
觀察到的模組的 Python 類型（由用戶提供）。此模組需要定義一個 from_float 函數，該函數定義如何從原始 fp32 模組創建觀察到的模組。
量化模組的 Python 類型（由用戶提供）。此模組需要定義一個 from_observed 函數，該函數定義如何從觀察到的模組創建量化模組。
傳遞給量化 API 的配置，描述了上述 (1)、(2)、(3)。

然後，框架將執行以下操作

在 prepare 模組交換期間，它將使用 (2) 中類別的 from_float 函數將 (1) 中指定的每種類型的模組轉換為 (2) 中指定的類型。
在 convert 模組交換期間，它將使用 (3) 中類別的 from_observed 函數將 (2) 中指定的每種類型的模組轉換為 (3) 中指定的類型。

目前，有一個要求，即 ObservedCustomModule 將具有一個張量輸出，並且框架（而不是用戶）將在該輸出上添加一個觀察器。觀察器將作為自訂模組實例的屬性存儲在 activation_post_process 鍵下。放鬆這些限制可能會在以後完成。

自訂 API 示例

import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict)

最佳實務¶

1. 如果您使用的是 x86 後端，我們需要使用 7 位元而不是 8 位元。請確保您縮小 quant\_min、quant\_max 的範圍，例如，如果 dtype 是 torch.quint8，請確保將自訂的 quant_min 設定為 0，並將 quant_max 設定為 127 (255 / 2)；如果 dtype 是 torch.qint8，請確保將自訂的 quant_min 設定為 -64 (-128 / 2)，並將 quant_max 設定為 63 (127 / 2)。如果您呼叫 torch.ao.quantization.get_default_qconfig(backend) 或 torch.ao.quantization.get_default_qat_qconfig(backend) 函式來取得 x86 或 qnnpack 後端的預設 qconfig，我們已經正確設定了這些值。

2. 如果選擇了 onednn 後端，則在預設 qconfig 映射 torch.ao.quantization.get_default_qconfig_mapping('onednn') 和預設 qconfig torch.ao.quantization.get_default_qconfig('onednn') 中，將對激活使用 8 位元。建議在支援向量神經網路指令 (VNNI) 的 CPU 上使用。否則，請將激活觀察器的 reduce_range 設定為 True，以便在不支援 VNNI 的 CPU 上獲得更好的準確性。

常見問題¶

如何在 GPU 上進行量化推論？

我們尚未提供官方的 GPU 支援，但這是一個積極開發的領域，您可以在這裡找到更多資訊。
哪裡可以獲得量化模型的 ONNX 支援？

如果您在匯出模型時遇到錯誤（使用 torch.onnx 下的 API），您可以在 PyTorch 儲存庫中建立問題。請在問題標題前加上 [ONNX]，並將問題標籤為 module: onnx。

如果您在使用 ONNX Runtime 時遇到問題，請在 GitHub - microsoft/onnxruntime 建立問題。
如何將量化與 LSTM 搭配使用？

在 Eager 模式和 fx 圖形模式量化中，我們都透過自訂模組 API 支援 LSTM。範例如下：Eager 模式：pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm；FX 圖形模式：pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm。

常見錯誤¶

將非量化張量傳遞至量化核心¶

如果您看到類似以下的錯誤：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

這表示您嘗試將非量化張量傳遞至量化核心。一種常見的解決方法是使用 torch.ao.quantization.QuantStub 來量化張量。在 Eager 模式量化中，這需要手動完成。以下是一個端到端範例：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

將量化張量傳遞至非量化核心¶