瞭解 CUDA 記憶體使用量¶

為了除錯 CUDA 記憶體使用量，PyTorch 提供了一種產生記憶體快照的方法，可以記錄任何時間點已分配的 CUDA 記憶體狀態，並選擇性地記錄導致該快照的分配事件歷史記錄。

然後，產生的快照可以拖放到託管在 pytorch.org/memory_viz 的互動式檢視器上，可以使用該檢視器來探索快照。

產生快照¶

記錄快照的常見模式是啟用記憶體歷史記錄，執行要觀察的程式碼，然後將快照儲存到檔案中。

# enable memory history, which will
# add tracebacks and event history to snapshots
torch.cuda.memory._record_memory_history()

run_your_code()
torch.cuda.memory._dump_snapshot("my_snapshot.pickle")

使用視覺化工具¶

開啟 pytorch.org/memory_viz 並將快照檔案拖放到視覺化工具中。視覺化工具是一個在您的電腦上本地執行的 JavaScript 應用程式。它不會上傳任何快照資料。

活動記憶體時間軸¶

活動記憶體時間軸顯示了特定 GPU 上快照中所有活動張量隨時間變化的情況。在圖表上平移/縮放以查看較小的分配。將滑鼠懸停在已分配的區塊上，以查看分配該區塊時的堆疊追蹤，以及地址等詳細資訊。當資料量很大時，可以調整細節滑桿以呈現較少的分配並提高效能。

分配器狀態歷史記錄¶

分配器狀態歷史記錄在左側的時間軸中顯示了個別的分配器事件。在時間軸中選擇一個事件，以查看該事件發生時分配器狀態的視覺化摘要。此摘要顯示了從 cudaMalloc 返回的每個區段，以及如何將其劃分為個別分配或可用空間的區塊。將滑鼠懸停在區段和區塊上，以查看分配記憶體時的堆疊追蹤。將滑鼠懸停在事件上，以查看事件發生時的堆疊追蹤，例如釋放張量時。記憶體不足錯誤會報告為 OOM 事件。查看 OOM 期間的記憶體狀態可能會讓您瞭解為什麼即使保留的記憶體仍然存在，分配仍然失敗。

堆疊追蹤資訊還會報告分配發生的地址。地址 b7f064c000000_0 是指地址 7f064c000000 處的 (b) 區塊，這是該地址第“_0”次被分配。可以在活動記憶體時間軸中查找此唯一字串，並在活動狀態歷史記錄中搜尋，以檢查分配或釋放張量時的記憶體狀態。

快照 API 參考¶

torch.cuda.memory._record_memory_history(enabled='all', context='all', stacks='all', max_entries=9223372036854775807, device=None)[source]¶

啟用與記憶體分配關聯的堆疊追蹤記錄，以便您在 torch.cuda.memory._snapshot() 中分辨出分配了任何記憶體區塊的程式碼。

除了使用每個目前的分配和釋放來保留堆疊追蹤之外，這還將啟用所有分配/釋放事件歷史記錄的記錄。

使用 torch.cuda.memory._snapshot() 檢索此資訊，並使用 _memory_viz.py 中的工具來視覺化快照。

Python 追蹤收集速度很快（每次追蹤 2 微秒），因此如果您預計需要除錯記憶體問題，可以考慮在生產作業中啟用此功能。

C++ 追蹤收集速度也很快（~50 奈秒/幀），對於許多典型程式來說，每次追蹤大約需要 ~2 微秒，但可能會因堆疊深度而異。

參數

enabled (Literal[None, "state", "all"], optional) – None，停用記憶體歷史記錄記錄。 “state”，保留目前已分配記憶體的資訊。 “all”，另外保留所有分配/釋放呼叫的歷史記錄。預設為“all”。
context (Literal[None, "state", "alloc", "all"], 選用) – None，不記錄任何追蹤記錄。 “state”，記錄目前已配置記憶體的追蹤記錄。 “alloc”，額外保留 alloc 呼叫的追蹤記錄。 “all”，額外保留 free 呼叫的追蹤記錄。預設為「all」。
stacks (Literal["python", "all"], 選用) – “python”，在追蹤記錄中包含 Python、TorchScript 和 Inductor 框架。 “all”，額外包含 C++ 框架。預設為「all」。
max_entries (int, 選用) – 在記錄的歷史記錄中最多保留 max_entries 個 alloc/free 事件。

torch.cuda.memory._snapshot(device=None)[原始碼]¶

在呼叫時儲存 CUDA 記憶體狀態的快照。

狀態表示為具有以下結構的字典。

class Snapshot(TypedDict):
    segments : List[Segment]
    device_traces: List[List[TraceEntry]]

class Segment(TypedDict):
    # Segments are memory returned from a cudaMalloc call.
    # The size of reserved memory is the sum of all Segments.
    # Segments are cached and reused for future allocations.
    # If the reuse is smaller than the segment, the segment
    # is split into more then one Block.
    # empty_cache() frees Segments that are entirely inactive.
    address: int
    total_size: int #  cudaMalloc'd size of segment
    stream: int
    segment_type: Literal['small', 'large'] # 'large' (>1MB)
    allocated_size: int # size of memory in use
    active_size: int # size of memory in use or in active_awaiting_free state
    blocks : List[Block]

class Block(TypedDict):
    # A piece of memory returned from the allocator, or
    # current cached but inactive.
    size: int
    requested_size: int # size requested during malloc, may be smaller than
                        # size due to rounding
    address: int
    state: Literal['active_allocated', # used by a tensor
                'active_awaiting_free', # waiting for another stream to finish using
                                        # this, then it will become free
                'inactive',] # free for reuse
    frames: List[Frame] # stack trace from where the allocation occurred

class Frame(TypedDict):
        filename: str
        line: int
        name: str

class TraceEntry(TypedDict):
    # When `torch.cuda.memory._record_memory_history()` is enabled,
    # the snapshot will contain TraceEntry objects that record each
    # action the allocator took.
    action: Literal[
    'alloc'  # memory allocated
    'free_requested', # the allocated received a call to free memory
    'free_completed', # the memory that was requested to be freed is now
                    # able to be used in future allocation calls
    'segment_alloc', # the caching allocator ask cudaMalloc for more memory
                    # and added it as a segment in its cache
    'segment_free',  # the caching allocator called cudaFree to return memory
                    # to cuda possibly trying free up memory to
                    # allocate more segments or because empty_caches was called
    'oom',          # the allocator threw an OOM exception. 'size' is
                    # the requested number of bytes that did not succeed
    'snapshot'      # the allocator generated a memory snapshot
                    # useful to coorelate a previously taken
                    # snapshot with this trace
    ]
    addr: int # not present for OOM
    frames: List[Frame]
    size: int
    stream: int
    device_free: int # only present for OOM, the amount of
                    # memory cuda still reports to be free

回傳: 快照字典物件

torch.cuda.memory._dump_snapshot(filename='dump_snapshot.pickle')[原始碼]¶

將 torch.memory._snapshot() 字典的 pickle 版本儲存到檔案中。

此檔案可以使用 pytorch.org/memory_viz 上的互動式快照檢閱器開啟。

參數: filename (str, 選用) – 要建立的檔案名稱。預設為「dump_snapshot.pickle」。