使用 Qualcomm AI Engine Direct 後端構建和執行 Llama 3 8B Instruct¶

本教程演示瞭如何將 Llama 3 8B Instruct 匯出到 Qualcomm AI Engine Direct 後端並在 Qualcomm 裝置上執行模型。

前置條件¶

如果您尚未設定 ExecuTorch 倉庫和環境，請按照設定 ExecuTorch 設定倉庫和開發環境。
閱讀使用 Qualcomm AI Engine Direct 後端構建和執行 ExecuTorch 頁面，瞭解如何在 Qualcomm 裝置上使用 Qualcomm AI Engine Direct 後端匯出和執行模型。
按照executorch llama 的 README 文件，瞭解如何透過 ExecuTorch 在移動裝置上執行 llama 模型。
具有 16GB RAM 的 Qualcomm 裝置
- 我們正在持續最佳化記憶體使用，以確保相容記憶體較低的裝置。
Qualcomm AI Engine Direct SDK 的版本為 2.28.0 或更高。

說明¶

步驟 1：從 Spin Quant 準備模型檢查點和最佳化矩陣¶

對於 Llama 3 的分詞器和檢查點，請參考 https://github.com/meta-llama/llama-models/blob/main/README.md 以獲取關於如何下載 tokenizer.model、consolidated.00.pth 和 params.json 的進一步說明。
要獲取最佳化矩陣，請參考 GitHub 上的 SpinQuant。您可以在“Quantized Models”部分下載最佳化後的旋轉矩陣。請選擇 LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0。

步驟 2：使用 Qualcomm AI Engine Direct 後端匯出到 ExecuTorch¶

在裝置上部署 Llama 3 等大型語言模型存在以下挑戰

模型尺寸過大，無法完全載入裝置記憶體進行推理。
模型載入和推理時間長。
量化困難。

為應對這些挑戰，我們實現了以下解決方案

使用 --pt2e_quantize qnn_16a4w 對啟用和權重進行量化，從而減小模型在磁碟上的尺寸並減輕推理時的記憶體壓力。
使用 --num_sharding 8 將模型分片為多個子部分。
執行圖轉換，將操作轉換為更適合加速器的操作或將其分解。
使用 --optimized_rotation_path <path_to_optimized_matrix> 應用 Spin Quant 的 R1 和 R2 以提高準確性。
使用 --calibration_data "<|start_header_id|>system<|end_header_id|..." 確保在 Llama 3 8B Instruct 量化過程中，校準包含提示模板中的特殊 token。有關提示模板的更多詳細資訊，請參考meta llama3 instruct 的模型卡。

要使用 Qualcomm AI Engine Direct 後端匯出 Llama 3 8B Instruct，請確保滿足以下條件

宿主機器具有超過 100GB 的記憶體（RAM + 交換空間）。
整個過程需要幾個小時。

# Please note that calibration_data must include the prompt template for special tokens.
python -m examples.models.llama.export_llama -t <path_to_tokenizer.model>
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct>  --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

步驟 3：在配備 Qualcomm SoC 的 Android 智慧手機上呼叫執行時¶

為 Android 構建帶 Qualcomm AI Engine Direct 後端的 executorch

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out .

cmake --build cmake-android-out -j16 --target install --config Release

為 Android 構建 llama runner

    cmake \
        -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake  \
        -DANDROID_ABI=arm64-v8a \
        -DCMAKE_INSTALL_PREFIX=cmake-android-out \
        -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
        -DEXECUTORCH_BUILD_QNN=ON \
        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
        -Bcmake-android-out/examples/models/llama examples/models/llama

    cmake --build cmake-android-out/examples/models/llama -j16 --config Release

透過 adb shell 在 Android 上執行前提條件：確保您已在手機的開發者選項中啟用 USB 除錯

3.1 連線您的 Android 手機

3.2 我們需要將所需的 QNN 庫推送到裝置。

# make sure you have write-permission on below path.
DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}

3.3 將模型、分詞器和 llama runner 二進位制檔案上傳到手機

adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
adb push cmake-out-android/examples/models/llama/llama_main ${DEVICE_DIR}

3.4 執行模型

adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"

您應該會看到訊息

<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the

未來展望¶

提升 Llama 3 Instruct 的效能
減輕推理時的記憶體壓力，以支援 12GB Qualcomm 裝置
支援更多 LLMs

常見問題¶

如果在復現本教程時遇到任何問題，請在 ExecuTorch 倉庫上提交 GitHub Issue 並使用 #qcom_aisw 標籤。