注意

點選此處下載完整的示例程式碼

PyTorch 中 `non_blocking` 和 `pin_memory()` 的良好使用指南¶

建立日期：2024 年 7 月 31 日 | 最後更新：2025 年 3 月 18 日 | 最後驗證：2024 年 11 月 5 日

引言¶

在許多 PyTorch 應用中，將資料從 CPU 傳輸到 GPU 是基礎操作。對於使用者來說，理解裝置之間資料傳輸最有效的工具和選項至關重要。本教程探討了 PyTorch 中兩種關鍵的裝置間資料傳輸方法：pin_memory() 和使用 non_blocking=True 選項的 to()。

您將學到什麼¶

透過非同步傳輸和記憶體固定可以最佳化張量從 CPU 到 GPU 的傳輸。然而，有一些重要的注意事項：

使用 tensor.pin_memory().to(device, non_blocking=True) 可能比直接使用 tensor.to(device) 慢一倍。
通常，tensor.to(device, non_blocking=True) 是提高傳輸速度的有效選擇。
雖然 cpu_tensor.to("cuda", non_blocking=True).mean() 可以正確執行，但嘗試使用 cuda_tensor.to("cpu", non_blocking=True).mean() 將導致錯誤的輸出。

前言¶

本教程中報告的效能取決於用於構建教程的系統。儘管結論適用於不同的系統，但具體觀察結果可能會因可用硬體而略有差異，尤其是在較舊的硬體上。本教程的主要目標是提供一個理論框架，以理解 CPU 到 GPU 的資料傳輸。然而，任何設計決策都應根據具體情況量身定製，並以基準測試的吞吐量測量以及手頭任務的具體要求為指導。

import torch

assert torch.cuda.is_available(), "A cuda device is required to run this tutorial"

本教程需要安裝 tensordict。如果你的環境中還沒有 tensordict，請在單獨的單元格中執行以下命令進行安裝

# Install tensordict with the following command
!pip3 install tensordict

我們首先概述圍繞這些概念的理論，然後轉向具體的功能測試示例。

背景知識¶

記憶體管理基礎知識¶

當在 PyTorch 中建立一個 CPU 張量時，張量的內容需要被放入記憶體。這裡談論的記憶體是一個相當複雜的概念，值得仔細研究。我們區分由記憶體管理單元處理的兩種型別的記憶體：RAM（為簡單起見）和磁碟上的交換空間（可能是硬碟或不是）。硬碟和 RAM（物理記憶體）中的可用空間共同構成了虛擬記憶體，它是可用總資源的抽象。簡而言之，虛擬記憶體使得可用空間比孤立的 RAM 更大，並建立了主記憶體比實際更大的幻覺。

在正常情況下，常規的 CPU 張量是可分頁的（pageable），這意味著它被分成稱為頁（pages）的塊，這些塊可以存在於虛擬記憶體的任何地方（無論是 RAM 還是磁碟）。正如前面提到的，這有一個優勢，就是記憶體看起來比實際的主記憶體更大。

通常，當程式訪問一個不在 RAM 中的頁時，會發生“頁錯誤”（page fault），然後作業系統 (OS) 會將該頁帶回 RAM（稱為“換入”或“頁入”）。反過來，作業系統可能不得不“換出”或“頁出”另一頁，以便為新頁騰出空間。

與可分頁記憶體相反，固定（pinned）記憶體（也稱為頁鎖定（page-locked）或不可分頁（non-pageable）記憶體）是一種不能被換出到磁碟的記憶體型別。它允許更快、更可預測的訪問時間，但缺點是它比可分頁記憶體（即主記憶體）更有限。

CUDA 與（不可）分頁記憶體¶

為了理解 CUDA 如何將張量從 CPU 複製到 CUDA，我們考慮以上兩種情況：

如果記憶體是頁鎖定的，裝置可以直接訪問主記憶體中的資料。記憶體地址是明確定義的，需要讀取這些資料的函式可以顯著加速。
如果記憶體是可分頁的，所有頁必須先被帶到主記憶體，然後才能傳送到 GPU。此操作可能需要時間，並且不如在頁鎖定張量上執行時可預測。

更精確地說，當 CUDA 將可分頁資料從 CPU 傳送到 GPU 時，它必須首先建立該資料的頁鎖定副本，然後才能進行傳輸。

使用 `non_blocking=True` 的非同步 vs. 同步操作 (CUDA `cudaMemcpyAsync`)¶

在執行從主機（例如 CPU）到裝置（例如 GPU）的複製時，CUDA 工具包提供了相對於主機同步或非同步執行這些操作的模式。

實際上，在呼叫 to() 時，PyTorch 總是會呼叫 cudaMemcpyAsync。如果 non_blocking=False（預設值），則在每次 cudaMemcpyAsync 呼叫後都會呼叫 cudaStreamSynchronize，這使得對 to() 的呼叫在主執行緒中是阻塞的。如果 non_blocking=True，則不會觸發同步，並且主機上的主執行緒不會被阻塞。因此，從主機的角度來看，可以同時將多個張量傳送到裝置，因為執行緒無需等待一個傳輸完成即可啟動另一個。

注意

通常，傳輸在裝置端是阻塞的（即使在主機端不是）：裝置上的複製不能在執行另一個操作時發生。然而，在某些高階場景中，複製和核心執行可以在 GPU 端同時進行。如下例所示，必須滿足三個要求才能啟用此功能：

裝置必須至少有一個空閒的 DMA（直接記憶體訪問）引擎。Volterra、Tesla 或 H100 等現代 GPU 架構擁有多個 DMA 引擎。
傳輸必須在單獨的非預設 cuda 流上完成。在 PyTorch 中，可以使用 Stream 處理 cuda 流。
源資料必須位於固定記憶體中。

我們透過對以下指令碼執行效能分析來演示這一點。

import contextlib

from torch.cuda import Stream


s = Stream()

torch.manual_seed(42)
t1_cpu_pinned = torch.randn(1024**2 * 5, pin_memory=True)
t2_cpu_paged = torch.randn(1024**2 * 5, pin_memory=False)
t3_cuda = torch.randn(1024**2 * 5, device="cuda:0")

assert torch.cuda.is_available()
device = torch.device("cuda", torch.cuda.current_device())


# The function we want to profile
def inner(pinned: bool, streamed: bool):
    with torch.cuda.stream(s) if streamed else contextlib.nullcontext():
        if pinned:
            t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)
        else:
            t2_cuda = t2_cpu_paged.to(device, non_blocking=True)
        t_star_cuda_h2d_event = s.record_event()
    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
    #  done in the other stream
    t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda
    t3_cuda_h2d_event = torch.cuda.current_stream().record_event()
    t_star_cuda_h2d_event.synchronize()
    t3_cuda_h2d_event.synchronize()


# Our profiler: profiles the `inner` function and stores the results in a .json file
def benchmark_with_profiler(
    pinned,
    streamed,
) -> None:
    torch._C._profiler._set_cuda_sync_enabled_val(True)
    wait, warmup, active = 1, 1, 2
    num_steps = wait + warmup + active
    rank = 0
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(
            wait=wait, warmup=warmup, active=active, repeat=1, skip_first=1
        ),
    ) as prof:
        for step_idx in range(1, num_steps + 1):
            inner(streamed=streamed, pinned=pinned)
            if rank is None or rank == 0:
                prof.step()
    prof.export_chrome_trace(f"trace_streamed{int(streamed)}_pinned{int(pinned)}.json")

在 chrome (chrome://tracing) 中載入這些效能分析跟蹤，結果如下：首先，讓我們看看如果可分頁張量在主流中傳送到 GPU 後執行 t3_cuda 的算術操作時會發生什麼

benchmark_with_profiler(streamed=False, pinned=False)

使用固定張量並不會顯著改變跟蹤，兩個操作仍然是連續執行的

benchmark_with_profiler(streamed=False, pinned=True)

在單獨的流上將可分頁張量傳送到 GPU 也是一個阻塞操作

benchmark_with_profiler(streamed=True, pinned=False)

只有固定張量複製到 GPU 在單獨流上進行時，才能與在主流上執行的另一個 cuda kernel 重疊

benchmark_with_profiler(streamed=True, pinned=True)

PyTorch 視角¶

`pin_memory()`¶

PyTorch 提供了透過 pin_memory() 方法和建構函式引數建立張量併發送到頁鎖定記憶體的可能性。在 CUDA 初始化了的機器上，CPU 張量可以透過 pin_memory() 方法轉換為固定記憶體。重要的是，pin_memory 在主機的主執行緒上是阻塞的：它會等待張量複製到頁鎖定記憶體後再執行下一個操作。新張量可以使用 zeros()、ones() 等函式和其它建構函式直接在固定記憶體中建立。

讓我們檢查一下固定記憶體併發送張量到 CUDA 的速度

import torch
import gc
from torch.utils.benchmark import Timer
import matplotlib.pyplot as plt


def timer(cmd):
    median = (
        Timer(cmd, globals=globals())
        .adaptive_autorange(min_run_time=1.0, max_run_time=20.0)
        .median
        * 1000
    )
    print(f"{cmd}: {median: 4.4f} ms")
    return median


# A tensor in pageable memory
pageable_tensor = torch.randn(1_000_000)

# A tensor in page-locked (pinned) memory
pinned_tensor = torch.randn(1_000_000, pin_memory=True)

# Runtimes:
pageable_to_device = timer("pageable_tensor.to('cuda:0')")
pinned_to_device = timer("pinned_tensor.to('cuda:0')")
pin_mem = timer("pageable_tensor.pin_memory()")
pin_mem_to_device = timer("pageable_tensor.pin_memory().to('cuda:0')")

# Ratios:
r1 = pinned_to_device / pageable_to_device
r2 = pin_mem_to_device / pageable_to_device

# Create a figure with the results
fig, ax = plt.subplots()

xlabels = [0, 1, 2]
bar_labels = [
    "pageable_tensor.to(device) (1x)",
    f"pinned_tensor.to(device) ({r1:4.2f}x)",
    f"pageable_tensor.pin_memory().to(device) ({r2:4.2f}x)"
    f"\npin_memory()={100*pin_mem/pin_mem_to_device:.2f}% of runtime.",
]
values = [pageable_to_device, pinned_to_device, pin_mem_to_device]
colors = ["tab:blue", "tab:red", "tab:orange"]
ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime (pin-memory)")
ax.set_xticks([])
ax.legend()

plt.show()

# Clear tensors
del pageable_tensor, pinned_tensor
_ = gc.collect()

我們可以觀察到，將固定記憶體張量轉換為 GPU 確實比可分頁張量快得多，因為在底層，可分頁張量在傳送到 GPU 之前必須先複製到固定記憶體中。

然而，與一些普遍的看法相反，在將可分頁張量轉換為 GPU 之前呼叫 pin_memory() 不會帶來任何顯著的速度提升，相反，這個呼叫通常比直接執行傳輸更慢。這是有道理的，因為我們實際上是在要求 Python 執行一個 CUDA 在從主機複製資料到裝置之前無論如何都會執行的操作。

注意

pin_memory 的 PyTorch 實現依賴於透過 cudaHostAlloc 在固定記憶體中建立一個全新的儲存，在極少數情況下，它可能比 cudaMemcpy 分塊傳輸資料更快。同樣，觀察結果可能因可用硬體、要傳送的張量大小或可用 RAM 量而異。

`non_blocking=True`¶

如前所述，許多 PyTorch 操作可以透過 non_blocking 引數選擇相對於主機非同步執行。

在這裡，為了準確地說明使用 non_blocking 的好處，我們將設計一個稍微複雜的實驗，因為我們想評估在呼叫和不呼叫 non_blocking 的情況下，將多個張量傳送到 GPU 的速度有多快。

# A simple loop that copies all tensors to cuda
def copy_to_device(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.to("cuda:0"))
    return result


# A loop that copies all tensors to cuda asynchronously
def copy_to_device_nonblocking(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.to("cuda:0", non_blocking=True))
    # We need to synchronize
    torch.cuda.synchronize()
    return result


# Create a list of tensors
tensors = [torch.randn(1000) for _ in range(1000)]
to_device = timer("copy_to_device(*tensors)")
to_device_nonblocking = timer("copy_to_device_nonblocking(*tensors)")

# Ratio
r1 = to_device_nonblocking / to_device

# Plot the results
fig, ax = plt.subplots()

xlabels = [0, 1]
bar_labels = [f"to(device) (1x)", f"to(device, non_blocking=True) ({r1:4.2f}x)"]
colors = ["tab:blue", "tab:red"]
values = [to_device, to_device_nonblocking]

ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime (non-blocking)")
ax.set_xticks([])
ax.legend()

plt.show()

為了更好地瞭解這裡發生的事情，讓我們對這兩個函式進行效能分析

from torch.profiler import profile, ProfilerActivity


def profile_mem(cmd):
    with profile(activities=[ProfilerActivity.CPU]) as prof:
        exec(cmd)
    print(cmd)
    print(prof.key_averages().table(row_limit=10))

首先，讓我們看看使用常規 to(device) 時的呼叫棧

print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))

現在是 non_blocking 版本

print(
    "Call to `to(device, non_blocking=True)`",
    profile_mem("copy_to_device_nonblocking(*tensors)"),
)

毫無疑問，使用 non_blocking=True 時結果更好，因為所有傳輸都在主機端同時啟動，並且只進行一次同步。

收益將取決於張量的數量和大小以及所使用的硬體。

注意

有趣的是，阻塞的 to("cuda") 實際上執行了與 non_blocking=True 版本相同的非同步裝置轉換操作 (cudaMemcpyAsync)，只是在每次複製後都有一個同步點。

協同效應¶

既然我們已經指出，將已在固定記憶體中的張量傳輸到 GPU 比從可分頁記憶體傳輸更快，並且我們知道非同步執行這些傳輸也比同步執行更快，那麼我們可以對這些方法的組合進行基準測試。首先，讓我們編寫幾個新函式，它們將對每個張量呼叫 pin_memory 和 to(device)

def pin_copy_to_device(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.pin_memory().to("cuda:0"))
    return result


def pin_copy_to_device_nonblocking(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.pin_memory().to("cuda:0", non_blocking=True))
    # We need to synchronize
    torch.cuda.synchronize()
    return result

對於較大的張量批次，使用 pin_memory() 的好處更加明顯

tensors = [torch.randn(1_000_000) for _ in range(1000)]
page_copy = timer("copy_to_device(*tensors)")
page_copy_nb = timer("copy_to_device_nonblocking(*tensors)")

tensors_pinned = [torch.randn(1_000_000, pin_memory=True) for _ in range(1000)]
pinned_copy = timer("copy_to_device(*tensors_pinned)")
pinned_copy_nb = timer("copy_to_device_nonblocking(*tensors_pinned)")

pin_and_copy = timer("pin_copy_to_device(*tensors)")
pin_and_copy_nb = timer("pin_copy_to_device_nonblocking(*tensors)")

# Plot
strategies = ("pageable copy", "pinned copy", "pin and copy")
blocking = {
    "blocking": [page_copy, pinned_copy, pin_and_copy],
    "non-blocking": [page_copy_nb, pinned_copy_nb, pin_and_copy_nb],
}

x = torch.arange(3)
width = 0.25
multiplier = 0


fig, ax = plt.subplots(layout="constrained")

for attribute, runtimes in blocking.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, runtimes, width, label=attribute)
    ax.bar_label(rects, padding=3, fmt="%.2f")
    multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel("Runtime (ms)")
ax.set_title("Runtime (pin-mem and non-blocking)")
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(strategies)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.legend(loc="upper left", ncols=3)

plt.show()

del tensors, tensors_pinned
_ = gc.collect()

其他複製方向 (GPU -> CPU, CPU -> MPS)¶

到目前為止，我們一直假設從 CPU 到 GPU 的非同步複製是安全的。這通常是正確的，因為 CUDA 會自動處理同步，以確保在讀取時訪問的資料是有效的，特別是在張量位於可分頁記憶體中時。

然而，在其他情況下，我們不能做出同樣的假設：當張量被放置在固定記憶體中時，在呼叫主機到裝置傳輸後修改原始副本可能會損壞 GPU 上接收到的資料。類似地，當傳輸方向相反，從 GPU 到 CPU，或從任何非 CPU 或 GPU 的裝置到任何非 CUDA 處理的 GPU 裝置（例如 MPS）時，如果沒有顯式同步，則無法保證在 GPU 上讀取的資料是有效的。

在這些場景中，這些傳輸不能保證在資料訪問時複製已完成。因此，主機上的資料可能不完整或不正確，實際上變成了無效資料（garbage）。

我們首先用一個固定記憶體張量來演示這一點

DELAY = 100000000
try:
    i = -1
    for i in range(100):
        # Create a tensor in pin-memory
        cpu_tensor = torch.ones(1024, 1024, pin_memory=True)
        torch.cuda.synchronize()
        # Send the tensor to CUDA
        cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
        torch.cuda._sleep(DELAY)
        # Corrupt the original tensor
        cpu_tensor.zero_()
        assert (cuda_tensor == 1).all()
    print("No test failed with non_blocking and pinned tensor")
except AssertionError:
    print(f"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests")

使用可分頁張量總是有效的

i = -1
for i in range(100):
    # Create a tensor in pageable memory
    cpu_tensor = torch.ones(1024, 1024)
    torch.cuda.synchronize()
    # Send the tensor to CUDA
    cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
    torch.cuda._sleep(DELAY)
    # Corrupt the original tensor
    cpu_tensor.zero_()
    assert (cuda_tensor == 1).all()
print("No test failed with non_blocking and pageable tensor")

現在讓我們演示一下，CUDA 到 CPU 的傳輸如果沒有同步，也無法產生可靠的輸出

tensor = (
    torch.arange(1, 1_000_000, dtype=torch.double, device="cuda")
    .expand(100, 999999)
    .clone()
)
torch.testing.assert_close(
    tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")
), tensor.mean()
try:
    i = -1
    for i in range(100):
        cpu_tensor = tensor.to("cpu", non_blocking=True)
        torch.testing.assert_close(
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
        )
    print("No test failed with non_blocking")
except AssertionError:
    print(f"{i}th test failed with non_blocking. Skipping remaining tests")
try:
    i = -1
    for i in range(100):
        cpu_tensor = tensor.to("cpu", non_blocking=True)
        torch.cuda.synchronize()
        torch.testing.assert_close(
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
        )
    print("No test failed with synchronize")
except AssertionError:
    print(f"One test failed with synchronize: {i}th assertion!")

通常，只有當目標是支援 CUDA 的裝置且原始張量位於可分頁記憶體中時，非同步複製到裝置才可以在沒有顯式同步的情況下是安全的。

總而言之，使用 non_blocking=True 從 CPU 複製資料到 GPU 是安全的，但對於任何其他方向，仍然可以使用 non_blocking=True，但使用者必須確保在訪問資料之前執行裝置同步。

實用建議¶

現在我們可以根據我們的觀察總結一些初步建議

總的來說，無論原始張量是否在固定記憶體中，non_blocking=True 都能提供良好的吞吐量。如果張量已經在固定記憶體中，傳輸可以加速，但從 Python 主執行緒手動將其傳送到固定記憶體是一個主機上的阻塞操作，因此會抵消使用 non_blocking=True 的大部分好處（因為 CUDA 無論如何都會執行 pin_memory 傳輸）。

現在你可能會合理地問，pin_memory() 方法有什麼用。在下一節中，我們將進一步探討如何使用它來進一步加速資料傳輸。

額外考量¶

眾所周知，PyTorch 提供了一個 DataLoader 類，其建構函式接受 pin_memory 引數。考慮到我們之前關於 pin_memory 的討論，你可能想知道如果記憶體固定本身是阻塞的，DataLoader 是如何加速資料傳輸的。

關鍵在於 DataLoader 使用了一個單獨的執行緒來處理從可分頁記憶體到固定記憶體的資料傳輸，從而防止了主執行緒的任何阻塞。

為了說明這一點，我們將使用來自同名庫的 TensorDict 原語。呼叫 to() 時，預設行為是非同步將張量傳送到裝置，然後進行一次 torch.device.synchronize() 呼叫。

此外，TensorDict.to() 包含一個 non_blocking_pin 選項，該選項會啟動多個執行緒在繼續執行 to(device) 之前執行 pin_memory()。這種方法可以進一步加速資料傳輸，如下例所示。

from tensordict import TensorDict
import torch
from torch.utils.benchmark import Timer
import matplotlib.pyplot as plt

# Create the dataset
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})

# Runtimes
copy_blocking = timer("td.to('cuda:0', non_blocking=False)")
copy_non_blocking = timer("td.to('cuda:0')")
copy_pin_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=0)")
copy_pin_multithread_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=4)")

# Rations
r1 = copy_non_blocking / copy_blocking
r2 = copy_pin_nb / copy_blocking
r3 = copy_pin_multithread_nb / copy_blocking

# Figure
fig, ax = plt.subplots()

xlabels = [0, 1, 2, 3]
bar_labels = [
    "Blocking copy (1x)",
    f"Non-blocking copy ({r1:4.2f}x)",
    f"Blocking pin, non-blocking copy ({r2:4.2f}x)",
    f"Non-blocking pin, non-blocking copy ({r3:4.2f}x)",
]
values = [copy_blocking, copy_non_blocking, copy_pin_nb, copy_pin_multithread_nb]
colors = ["tab:blue", "tab:red", "tab:orange", "tab:green"]

ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime")
ax.set_xticks([])
ax.legend()

plt.show()

在此示例中，我們正在將許多大型張量從 CPU 傳輸到 GPU。這種情況非常適合利用多執行緒的 pin_memory()，這可以顯著提高效能。然而，如果張量很小，與多執行緒相關的開銷可能超過其帶來的好處。同樣，如果張量數量很少，在單獨執行緒上固定張量的優勢也變得有限。

另外需要注意的是，雖然在固定記憶體中建立永久緩衝區用於在將張量傳輸到 GPU 之前從可分頁記憶體中轉移張量可能看起來很有優勢，但這種策略不一定會加速計算。將資料複製到固定記憶體引起的固有瓶頸仍然是一個限制因素。

此外，將駐留在磁碟上（無論是在共享記憶體還是檔案中）的資料傳輸到 GPU 通常需要一箇中間步驟，即將資料複製到固定記憶體（位於 RAM 中）。在這種情況下，對大型資料傳輸利用 non_blocking 可能會顯著增加 RAM 消耗，從而可能導致不利影響。

實際上，沒有一刀切的解決方案。使用多執行緒 pin_memory 並結合 non_blocking 傳輸的有效性取決於多種因素，包括具體的系統、作業系統、硬體以及正在執行的任務的性質。以下是在嘗試加速 CPU 和 GPU 之間的資料傳輸或比較不同場景下的吞吐量時需要檢查的因素列表：

可用核心數量

有多少 CPU 核心可用？系統是否與其他可能爭奪資源的使用者或程序共享？
核心利用率

CPU 核心是否被其他程序大量佔用？應用程式是否在資料傳輸的同時併發執行其他 CPU 密集型任務？
記憶體利用率

當前使用了多少可分頁記憶體和頁鎖定記憶體？是否有足夠的可用記憶體來分配額外的固定記憶體而不影響系統性能？記住天下沒有免費的午餐，例如 pin_memory 會消耗 RAM 並可能影響其他任務。
CUDA 裝置能力

GPU 是否支援多個 DMA 引擎進行併發資料傳輸？正在使用的 CUDA 裝置有哪些具體能力和限制？
待發送張量數量

典型操作中傳輸了多少張量？
待發送張量大小

正在傳輸的張量的大小是多少？少量大張量或大量小張量可能無法從同一傳輸程式中獲益。
系統架構

系統架構如何影響資料傳輸速度（例如，匯流排速度、網路延遲）？

此外，在鎖定記憶體中分配大量張量或大尺寸張量可能會佔用相當一部分 RAM。這減少了用於其他關鍵操作（例如分頁）的可用記憶體，從而對演算法的整體效能產生負面影響。

結論¶

在本教程中，我們探討了將張量從主機發送到裝置時影響傳輸速度和記憶體管理的幾個關鍵因素。我們瞭解到，使用 non_blocking=True 通常會加速資料傳輸，並且 pin_memory() 如果正確實現，也可以提升效能。然而，這些技術需要精心設計和校準才能有效。

請記住，對程式碼進行效能分析並密切關注記憶體消耗對於最佳化資源使用和實現最佳效能至關重要。

其他資源¶

如果您在使用 CUDA 裝置時遇到記憶體複製問題，或者想了解更多關於本教程討論的內容，請查閱以下參考資料

指令碼總執行時間： ( 0 minutes 0.000 seconds)

由 Sphinx-Gallery 生成的相簿

PyTorch 中 non_blocking 和 pin_memory() 的良好使用指南¶

引言¶