ExecuTorch 中的 LLMs 介紹¶

歡迎閱讀 LLM 手冊！本手冊旨在提供一個實用示例，展示如何利用 ExecuTorch 來載入您自己的大型語言模型 (LLMs)。我們的主要目標是提供清晰簡潔的指南，說明如何將我們的系統與您自己的 LLMs 整合。

請注意，本專案旨在作為演示，而非具有最佳效能的完整功能示例。因此，諸如取樣器、分詞器等某些元件僅以最簡版本提供，純粹用於演示目的。因此，模型產生的結果可能會有所不同，並且並非總是最優的。

我們鼓勵使用者將本專案作為起點，並根據其特定需求進行調整，包括建立您自己的分詞器、取樣器、加速後端及其他元件版本。我們希望本專案能為您的 LLMs 和 ExecuTorch 之旅提供有益指導。

要以最佳效能部署 Llama，請參閱Llama 指南。

目錄¶

前提條件
Hello World 示例
量化
使用移動端加速
除錯和效能分析
如何使用自定義核心
如何構建移動應用

前提條件¶

要遵循本指南，您需要克隆 ExecuTorch 倉庫並安裝依賴項。ExecuTorch 推薦使用 Python 3.10 並使用 Conda 管理您的環境。雖然不強制要求使用 Conda，但請注意，根據您的環境，您可能需要將 python/pip 替換為 python3/pip3。

conda

可以在此處找到安裝 miniconda 的說明。

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

# Clone the ExecuTorch repository.
mkdir third-party
git clone -b release/0.6 https://github.com/pytorch/executorch.git third-party/executorch && cd third-party/executorch

# Create either a Python virtual environment:
python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip

# Or a Conda environment:
conda create -yn executorch python=3.10.0 && conda activate executorch

# Install requirements
./install_executorch.sh

cd ../..

pyenv-virtualenv

可以在此處找到安裝 pyenv-virtualenv 的說明。

重要的是，如果透過 brew 安裝 pyenv，它不會自動在終端中啟用 pyenv，從而導致錯誤。執行以下命令啟用。請參閱上面的 pyenv-virtualenv 安裝指南，瞭解如何將其新增到您的 .bashrc 或 .zshrc 中，以避免手動執行這些命令。

eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

pyenv install -s 3.10
pyenv virtualenv 3.10 executorch
pyenv activate executorch

# Clone the ExecuTorch repository.
git clone -b release/0.6 https://github.com/pytorch/executorch.git third-party/executorch && cd third-party/executorch

# Install requirements.
PYTHON_EXECUTABLE=python ./install_executorch.sh

cd ../..

更多資訊請參閱設定 ExecuTorch。

在本地執行大型語言模型¶

本示例使用 Karpathy 的 nanoGPT，它是 GPT-2 124M 的一個最小實現。本指南適用於其他語言模型，因為 ExecuTorch 不依賴於具體的模型。

使用 ExecuTorch 執行模型有兩個步驟

匯出模型。此步驟將模型預處理為適合執行時執行的格式。
在執行時，載入模型檔案並使用 ExecuTorch 執行時執行。

匯出步驟提前進行，通常作為應用構建的一部分或在模型更改時進行。生成的 .pte 檔案隨應用一起分發。在執行時，應用載入 .pte 檔案並將其傳遞給 ExecuTorch 執行時。

步驟 1. 匯出到 ExecuTorch¶

匯出過程將 PyTorch 模型轉換為可在消費裝置上高效執行的格式。

對於此示例，您需要 nanoGPT 模型和相應的分詞器詞彙表。

curl

curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O
curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O

wget

wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json

要將模型轉換為針對獨立執行最佳化的格式，需要執行兩個步驟。首先，使用 PyTorch 的 export 函式將 PyTorch 模型轉換為中間的、平臺無關的中間表示。然後使用 ExecuTorch 的 to_edge 和 to_executorch 方法準備模型以供裝置端執行。這將建立一個 .pte 檔案，該檔案可以在執行時由桌面或移動應用載入。

建立一個名為 export_nanogpt.py 的檔案，內容如下

# export_nanogpt.py

import torch

from executorch.exir import EdgeCompileConfig, to_edge
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export, export_for_training

from model import GPT

# Load the model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.com.tw/executorch/0.6/concepts#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
edge_config = EdgeCompileConfig(_check_ir_validity=False)
edge_manager = to_edge(traced_model,  compile_config=edge_config)
et_program = edge_manager.to_executorch()

# Save the ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

要進行匯出，使用 python export_nanogpt.py（或 python3，取決於您的環境）執行指令碼。這將在當前目錄中生成一個 nanogpt.pte 檔案。

更多資訊請參閱匯出到 ExecuTorch 和 torch.export。

步驟 2. 呼叫執行時¶

ExecuTorch 提供了一套執行時 API 和型別來載入和執行模型。

建立一個名為 main.cpp 的檔案，內容如下

// main.cpp

#include <cstdint>

#include "basic_sampler.h"
#include "basic_tokenizer.h"

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
#include <executorch/runtime/core/evalue.h>
#include <executorch/runtime/core/exec_aten/exec_aten.h>
#include <executorch/runtime/core/result.h>

using executorch::aten::ScalarType;
using executorch::aten::Tensor;
using executorch::extension::from_blob;
using executorch::extension::Module;
using executorch::runtime::EValue;
using executorch::runtime::Result;

模型的輸入和輸出採用張量形式。張量可以被視為一個多維陣列。ExecuTorch 的 EValue 類提供了對張量和其他 ExecuTorch 資料型別的包裝。

由於 LLM 一次生成一個 Token，驅動程式碼需要重複呼叫模型，逐個 Token 構建輸出。每個生成的 Token 都作為下一次執行的輸入被傳遞。

// main.cpp

// The value of the gpt2 `<|endoftext|>` token.
#define ENDOFTEXT_TOKEN 50256

std::string generate(
    Module& llm_model,
    std::string& prompt,
    BasicTokenizer& tokenizer,
    BasicSampler& sampler,
    size_t max_input_length,
    size_t max_output_length) {
  // Convert the input text into a list of integers (tokens) that represents it,
  // using the string-to-token mapping that the model was trained on. Each token
  // is an integer that represents a word or part of a word.
  std::vector<int64_t> input_tokens = tokenizer.encode(prompt);
  std::vector<int64_t> output_tokens;

  for (auto i = 0u; i < max_output_length; i++) {
    // Convert the input_tokens from a vector of int64_t to EValue. EValue is a
    // unified data type in the ExecuTorch runtime.
    auto inputs = from_blob(
        input_tokens.data(),
        {1, static_cast<int>(input_tokens.size())},
        ScalarType::Long);

    // Run the model. It will return a tensor of logits (log-probabilities).
    auto logits_evalue = llm_model.forward(inputs);

    // Convert the output logits from EValue to std::vector, which is what the
    // sampler expects.
    Tensor logits_tensor = logits_evalue.get()[0].toTensor();
    std::vector<float> logits(
        logits_tensor.data_ptr<float>(),
        logits_tensor.data_ptr<float>() + logits_tensor.numel());

    // Sample the next token from the logits.
    int64_t next_token = sampler.sample(logits);

    // Break if we reached the end of the text.
    if (next_token == ENDOFTEXT_TOKEN) {
      break;
    }

    // Add the next token to the output.
    output_tokens.push_back(next_token);

    std::cout << tokenizer.decode({next_token});
    std::cout.flush();

    // Update next input.
    input_tokens.push_back(next_token);
    if (input_tokens.size() > max_input_length) {
      input_tokens.erase(input_tokens.begin());
    }
  }

  std::cout << std::endl;

  // Convert the output tokens into a human-readable string.
  std::string output_string = tokenizer.decode(output_tokens);
  return output_string;
}

的 Module 類負責載入 .pte 檔案並準備執行。

分詞器負責將人類可讀的提示字串表示形式轉換為模型所需的數值形式。為此，分詞器將短子字串與給定的 Token ID 相關聯。Token 可以被認為是表示單詞或單詞的一部分，儘管實際上，它們可能是任意的字元序列。

分詞器從檔案中載入詞彙表，該檔案包含每個 Token ID 及其表示的文字之間的對映。呼叫 tokenizer.encode() 和 tokenizer.decode() 可在字串和 Token 表示形式之間進行轉換。

取樣器負責根據模型輸出的 logits（即對數機率）選擇下一個 Token。LLM 為每個可能的下一個 Token 返回一個 logit 值。取樣器根據某種策略選擇要使用的 Token。此處使用的最簡單方法是選擇具有最高 logit 值的 Token。

取樣器可以提供可配置的選項，例如輸出選擇的可配置隨機性、重複 Token 的懲罰以及優先或非優先處理特定 Token 的偏置。

// main.cpp

int main() {
  // Set up the prompt. This provides the seed text for the model to elaborate.
  std::cout << "Enter model prompt: ";
  std::string prompt;
  std::getline(std::cin, prompt);

  // The tokenizer is used to convert between tokens (used by the model) and
  // human-readable strings.
  BasicTokenizer tokenizer("vocab.json");

  // The sampler is used to sample the next token from the logits.
  BasicSampler sampler = BasicSampler();

  // Load the exported nanoGPT program, which was generated via the previous
  // steps.
  Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);

  const auto max_input_tokens = 1024;
  const auto max_output_tokens = 30;
  std::cout << prompt;
  generate(
      model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
}

最後，將以下檔案下載到與 main.cpp 相同的目錄中

curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h

要了解更多資訊，請參閱執行時 API 教程。

構建和執行¶

ExecuTorch 使用 CMake 構建系統。要編譯和連結 ExecuTorch 執行時，請透過 add_directory 包含 ExecuTorch 專案，並連結 executorch 和附加依賴項。

建立一個名為 CMakeLists.txt 的檔案，內容如下

# CMakeLists.txt

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
)

此時，工作目錄應包含以下檔案

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json
nanogpt.pte

如果這些檔案都已存在，您現在可以構建並執行

(mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

您應該看到訊息

Enter model prompt:

為模型輸入一些種子文字並按回車鍵。這裡我們使用“Hello world!”作為示例提示。

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

此時，它可能會執行得非常慢。這是因為 ExecuTorch 尚未被告知針對特定硬體進行最佳化（Delegation），並且它正在進行所有 32 位浮點計算（沒有量化）。

Delegation¶

雖然 ExecuTorch 為所有運算元提供了可移植的、跨平臺的實現，但它也為許多不同的目標提供了專門的後端。這些後端包括但不限於透過 XNNPACK 後端實現的 x86 和 ARM CPU 加速，透過 Core ML 後端和 Metal Performance Shader (MPS) 後端實現的 Apple 加速，以及透過 Vulkan 後端實現的 GPU 加速。

由於最佳化是針對特定後端的，因此每個 pte 檔案都是針對匯出時指定的一個或多個後端。為了支援多種裝置，例如針對 Android 的 XNNPACK 加速和針對 iOS 的 Core ML，需要為每個後端匯出一個單獨的 PTE 檔案。

在匯出過程中將模型 Delegation 到特定後端時，ExecuTorch 使用 to_edge_transform_and_lower() 函式。此函式接受從 torch.export 匯出的程式以及一個後端特定的 partitioner 物件。Partitioner 識別計算圖中可以由目標後端最佳化的部分。在 to_edge_transform_and_lower() 中，匯出的程式被轉換為 Edge Dialect 程式。然後，Partitioner 將相容的圖部分 Delegation 給後端進行加速和最佳化。任何未 Delegation 的圖部分將由 ExecuTorch 的預設運算元實現執行。

要將匯出的模型 Delegation 到特定後端，我們需要首先從 ExecuTorch 程式碼庫匯入其 partitioner 以及 Edge 編譯配置，然後呼叫 to_edge_transform_and_lower。

以下是如何將 nanoGPT Delegation 給 XNNPACK 的示例（例如，如果您要部署到 Android 手機）

# export_nanogpt.py

# Load partitioner for Xnnpack backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# Model to be delegated to specific backend should use specific edge compile config
from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower

import torch
from torch.export import export
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export_for_training

from model import GPT

# Load the nanoGPT model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (
        torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
    )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.com.tw/executorch/0.6/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
edge_config = get_xnnpack_edge_compile_config()
# Converted to edge program and then delegate exported model to Xnnpack backend
# by invoking `to` function with Xnnpack partitioner.
edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config)
et_program = edge_manager.to_executorch()

# Save the Xnnpack-delegated ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

此外，更新 CMakeLists.txt 以構建 XNNPACK 後端並將其連結到 ExecuTorch 執行器。

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)
option(EXECUTORCH_BUILD_XNNPACK "" ON) # Build with Xnnpack backend

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
          xnnpack_backend # Provides the XNNPACK CPU acceleration backend
)

保持程式碼的其餘部分不變。更多詳細資訊請參閱匯出到 ExecuTorch 和呼叫執行時。

此時，工作目錄應包含以下檔案

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json

如果這些都存在，您現在可以匯出經過 Xnnpack Delegation 的 pte 模型

python export_nanogpt.py

它將在相同的工作目錄下生成 nanogpt.pte。

然後我們可以透過以下方式構建並執行模型

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

您應該看到訊息

Enter model prompt:

為模型輸入一些種子文字並按回車鍵。這裡我們使用“Hello world!”作為示例提示。

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

經過 Delegation 的模型應該比未 Delegation 的模型明顯更快。

有關後端 Delegation 的更多資訊，請參閱 ExecuTorch 關於XNNPACK 後端、Core ML 後端和 Qualcomm AI Engine Direct 後端的指南。

量化¶

量化是指使用較低精度型別執行計算和儲存張量的一系列技術。與 32 位浮點相比，使用 8 位整數可以顯著提高速度並減少記憶體使用。量化模型有多種方法，所需的預處理量、使用的資料型別以及對模型精度和效能的影響各不相同。

由於移動裝置上的計算和記憶體資源高度受限，因此需要某種形式的量化才能在消費電子產品上部署大型模型。特別是大型語言模型，如 Llama2，可能需要將模型權重量化到 4 位或更低。

利用量化需要在匯出之前對模型進行轉換。PyTorch 為此目的提供了 pt2e (PyTorch 2 Export) API。本示例針對使用 XNNPACK delegate 的 CPU 加速。因此，它需要使用 XNNPACK 特定的量化器。針對不同的後端將需要使用相應的量化器。

要將 8 位整數動態量化與 XNNPACK delegate 結合使用，請呼叫 prepare_pt2e，透過使用代表性輸入執行來校準模型，然後呼叫 convert_pt2e。這會更新計算圖，以在可用時使用量化運算元。

# export_nanogpt.py

from executorch.backends.transforms.duplicate_dynamic_quant_chain import (
    DuplicateDynamicQuantChainPass,
)
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
    get_symmetric_quantization_config,
    XNNPACKQuantizer,
)
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e

# Use dynamic, per-channel quantization.
xnnpack_quant_config = get_symmetric_quantization_config(
    is_per_channel=True, is_dynamic=True
)
xnnpack_quantizer = XNNPACKQuantizer()
xnnpack_quantizer.set_global(xnnpack_quant_config)

m = export_for_training(model, example_inputs).module()

# Annotate the model for quantization. This prepares the model for calibration.
m = prepare_pt2e(m, xnnpack_quantizer)

# Calibrate the model using representative inputs. This allows the quantization
# logic to determine the expected range of values in each tensor.
m(*example_inputs)

# Perform the actual quantization.
m = convert_pt2e(m, fold_quantize=False)
DuplicateDynamicQuantChainPass()(m)

traced_model = export(m, example_inputs)

此外，新增或更新 to_edge_transform_and_lower() 呼叫以使用 XnnpackPartitioner。這會指示 ExecuTorch 透過 XNNPACK 後端最佳化模型以進行 CPU 執行。

from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
    XnnpackPartitioner,
)

edge_config = get_xnnpack_edge_compile_config()
# Convert to edge dialect and lower to XNNPack.
edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config)
et_program = edge_manager.to_executorch()

with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

然後執行

python export_nanogpt.py
./cmake-out/nanogpt_runner

更多資訊請參閱ExecuTorch 中的量化。

效能分析和除錯¶

透過呼叫 to_edge_transform_and_lower() 對模型進行轉換後，您可能想看看哪些部分被 Delegation 了，哪些沒有。ExecuTorch 提供了實用方法來深入瞭解 Delegation 情況。您可以使用這些資訊來檢視底層計算並診斷潛在的效能問題。模型作者可以使用這些資訊來構建與目標後端相容的模型。

視覺化 Delegation 情況¶

get_delegation_info() 方法提供了一個摘要，說明在呼叫 to_edge_transform_and_lower() 後模型發生了什麼

from executorch.devtools.backend_debug import get_delegation_info
from tabulate import tabulate

# ... After call to to_edge_transform_and_lower(), but before to_executorch()
graph_module = edge_manager.exported_program().graph_module
delegation_info = get_delegation_info(graph_module)
print(delegation_info.get_summary())
df = delegation_info.get_operator_delegation_dataframe()
print(tabulate(df, headers="keys", tablefmt="fancy_grid"))

對於面向 XNNPACK 後端的 nanoGPT，您可能會看到以下內容（請注意，下面的數字僅用於說明目的，實際值可能有所不同）

Total  delegated  subgraphs:  145
Number  of  delegated  nodes:  350
Number  of  non-delegated  nodes:  760

	運算元型別	# 在 Delegation 圖中	# 在非 Delegation 圖中
0	aten__softmax_default	12	0
1	aten_add_tensor	37	0
2	aten_addmm_default	48	0
3	aten_any_dim	0	12
	…
25	aten_view_copy_default	96	122
	…
30	總計	350	760

從表中可以看出，運算元 aten_view_copy_default 在 Delegation 圖中出現 96 次，在非 Delegation 圖中出現 122 次。要檢視更詳細的資訊，可以使用 format_delegated_graph() 方法獲取整個圖的格式化字串輸出，或使用 print_delegated_graph() 直接列印。

from executorch.exir.backend.utils import format_delegated_graph
graph_module = edge_manager.exported_program().graph_module
print(format_delegated_graph(graph_module))

對於大型模型，這可能會生成大量輸出。考慮使用“Control+F”或“Command+F”來查詢您感興趣的運算元（例如“aten_view_copy_default”）。觀察哪些例項不在 lowered 圖下。

在下面的 nanoGPT 輸出片段中，請注意 transformer 模組已被 Delegation 到 XNNPACK，而 where 運算元則沒有。

%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {})
%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144]
backend_id: XnnpackBackend
lowered graph():
    %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight]
    %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias]
    %getitem : [num_users=1] = placeholder[target=getitem]
    %sym_size : [num_users=2] = placeholder[target=sym_size]
    %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {})
    %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {})
    %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {})
    %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {})
    return [aten_view_copy_default_1]

效能分析¶

透過 ExecuTorch 開發者工具，使用者能夠對模型執行進行效能分析，獲取模型中每個運算元的計時資訊。

前提條件¶

ETRecord 生成（可選）¶

ETRecord 是在匯出時生成的一種工件，包含模型圖和將 ExecuTorch 程式連結到原始 PyTorch 模型的源級元資料。您可以在沒有 ETRecord 的情況下檢視所有效能分析事件，但使用 ETRecord，您還可以將每個事件連結到正在執行的運算元型別、模組層次結構以及原始 PyTorch 原始碼的堆疊跟蹤。更多資訊請參閱ETRecord 文件。

在您的匯出指令碼中，呼叫 to_edge() 和 to_executorch() 後，使用來自 to_edge() 的 EdgeProgramManager 和來自 to_executorch() 的 ExecuTorchProgramManager 呼叫 generate_etrecord()。請務必複製 EdgeProgramManager，因為對 to_edge_transform_and_lower() 的呼叫會原地修改圖。

# export_nanogpt.py

import copy
from executorch.devtools import generate_etrecord

# Make the deep copy immediately after to to_edge()
edge_manager_copy = copy.deepcopy(edge_manager)

# ...
# Generate ETRecord right after to_executorch()
etrecord_path = "etrecord.bin"
generate_etrecord(etrecord_path, edge_manager_copy, et_program)

執行匯出指令碼，ETRecord 將生成為 etrecord.bin。

ETDump 生成¶

ETDump 是在執行時生成的一種工件，包含模型執行的跟蹤資訊。更多資訊請參閱ETDump 文件。

在您的程式碼中包含 ETDump 標頭檔案和名稱空間。

// main.cpp

#include <executorch/devtools/etdump/etdump_flatcc.h>

using executorch::etdump::ETDumpGen;
using torch::executor::etdump_result;

建立 ETDumpGen 類的一個例項，並將其傳遞給 Module 建構函式。

std::unique_ptr<ETDumpGen> etdump_gen_ = std::make_unique<ETDumpGen>();
Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors, std::move(etdump_gen_));

呼叫 generate() 後，將 ETDump 儲存到檔案。如果需要，您可以在單個跟蹤檔案中捕獲多次模型執行。

ETDumpGen* etdump_gen = static_cast<ETDumpGen*>(model.event_tracer());

ET_LOG(Info, "ETDump size: %zu blocks", etdump_gen->get_num_blocks());
etdump_result result = etdump_gen->get_etdump_data();
if (result.buf != nullptr && result.size > 0) {
    // On a device with a file system, users can just write it to a file.
    FILE* f = fopen("etdump.etdp", "w+");
    fwrite((uint8_t*)result.buf, 1, result.size, f);
    fclose(f);
    free(result.buf);
}

此外，更新 CMakeLists.txt 以使用開發者工具進行構建，並啟用事件跟蹤和記錄到 ETDump。

option(EXECUTORCH_ENABLE_EVENT_TRACER "" ON)
option(EXECUTORCH_BUILD_DEVTOOLS "" ON)

# ...

target_link_libraries(
    # ... omit existing ones
    etdump) # Provides event tracing and logging

target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED)
target_compile_options(portable_ops_lib PUBLIC -DET_EVENT_TRACER_ENABLED)

構建並執行執行器，您將看到生成了一個名為“etdump.etdp”的檔案。（請注意，這次我們在 Release 模式下構建是為了解決 flatccrt 的構建限制。）

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake -DCMAKE_BUILD_TYPE=Release ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

使用 Inspector API 進行分析¶

收集到除錯工件 ETDump（以及可選的 ETRecord）後，您可以使用 Inspector API 檢視效能資訊。

from executorch.devtools import Inspector

inspector = Inspector(etdump_path="etdump.etdp")
# If you also generated an ETRecord, then pass that in as well: `inspector = Inspector(etdump_path="etdump.etdp", etrecord="etrecord.bin")`

with open("inspector_out.txt", "w") as file:
    inspector.print_data_tabular(file)

這會將效能資料以表格格式列印到“inspector_out.txt”中，每行是一個性能分析事件。前幾行如下所示：檢視全尺寸影像

要了解更多關於 Inspector 及其提供的豐富功能的資訊，請參閱Inspector API 參考。

自定義核心¶

藉助 ExecuTorch 自定義運算元 API，自定義運算元和核心的作者可以輕鬆地將其核心引入 PyTorch/ExecuTorch。

在 ExecuTorch 中使用自定義核心有三個步驟

使用 ExecuTorch 型別編寫自定義核心。
將自定義核心編譯並連結到 AOT Python 環境和執行時二進位制檔案。
源到源轉換，將一個運算元替換為自定義運算元。

更多資訊請參閱PyTorch 自定義運算元和 ExecuTorch 核心註冊。

如何構建移動應用¶

請參閱在 iOS 和 Android 上使用 ExecuTorch 構建和執行 LLMs 的說明。