SSD Embedding 運算元¶

CUDA 運算元¶

enum RocksdbWriteMode¶

rocksdb 寫入模式

在 SSD 解除安裝中，每個訓練迭代有 3 種寫入方式：FWD_ROCKSDB_READ：快取查詢會將 rocksdb 中的未快取資料移動到前向路徑上的 L2 快取中

FWD_L1_EVICTION：L1 快取驅逐會將資料驅逐到前向路徑上的 L2 快取中

BWD_L1_CNFLCT_MISS_WRITE_BACK：L1 衝突未命中會將資料插入到 L2 中以進行後向路徑上的嵌入更新

以上所有的 L2 快取填充在 L2 快取滿時都可能觸發 rocksdb 寫入

此外，我們將在 L2 重新整理時執行 ssd IO

值

enumerator FWD_ROCKSDB_READ¶

enumerator FWD_L1_EVICTION¶

enumerator BWD_L1_CNFLCT_MISS_WRITE_BACK¶

enumerator FLUSH¶

inline size_t hash_shard(int64_t id, size_t num_shards)¶

用於 SSD L2 快取和 rocksdb 分片演算法的雜湊函式

引數:

id – 分片鍵
num_shards – 分片範圍

返回值:

分片 ID 範圍為 [0, num_shards)

std::tuple<at::Tensor, at::Tensor> get_bucket_sorted_indices_and_bucket_tensor(const at::Tensor &unordered_indices, int64_t hash_mode, int64_t bucket_start, int64_t bucket_end, std::optional<int64_t> bucket_size, std::optional<int64_t> total_num_buckets)¶

給定一個包含隨機順序 id 的張量，返回 2 個張量。張量 1 包含按桶升序排序的 id，例如給定 [1,2,3,4] 和 2 個桶 [1, 4) 和 [4, 7)，輸出將是 [1,2,3,4] 或 [2, 1, 3, 4]，id 1, 2, 3 必須在 4 之前，但 1 2 3 可以按任意順序排列。張量 2 包含每個桶 ID（張量偏移）中的嵌入數量，在上面的示例中，張量 2 將是 [3, 1]，其中第一個項對應於第一個桶 ID，值 3 表示第一個桶 ID 中有 3 個 id

引數:

unordered_indices – 無序 id，此處的 id 可能是原始（非線性化）id
hash_mode – 0 表示按模雜湊，1 表示按交織雜湊
bucket_start – 全域性桶 ID，桶範圍的起始
bucket_end – 全域性桶 ID，桶範圍的結束
bucket_size – 可選的桶的虛擬大小（輸入空間，例如 2^50）
total_num_buckets – 可選的，每個訓練模型的總桶數

返回值:

包含 2 個張量的列表，第一個張量是按桶排序的 id，第二個張量是桶大小

void cuda_callback_func(cudaStream_t stream, cudaError_t status, void *functor)¶

cudaStreamAddCallback 的回撥函式

cudaStreamAddCallback 的一個通用回撥函式，即 cudaStreamCallback_t callback。此函式將 functor 轉換為 void 函式，呼叫它然後刪除它（刪除發生在另一個執行緒中）

引數:

stream – cudaStreamAddCallback 操作的 CUDA 流
status – CUDA 狀態
functor – 將被呼叫的函式物件

返回值:

無

Tensor masked_index_put_cuda(Tensor self, Tensor indices, Tensor values, Tensor count, const bool use_pipeline, const int64_t preferred_sms)¶

類似於 torch.Tensor.index_put，但忽略 indices < 0

masked_index_put_cuda 僅支援 2D 輸入 values。它使用 indices 中 >= 0 的行索引，將 values 中的 count 行放入 self 中。

# Equivalent PyTorch Python code
indices = indices[:count]
filter_ = indices >= 0
indices_ = indices[filter_]
self[indices_] = values[filter_.nonzero().flatten()]

引數:

self – 2D 輸出張量（被索引的張量）
indices – 1D 索引張量
values – 2D 輸入張量
count – 包含要處理的 indices 長度的張量
use_pipeline – 一個標誌，指示此核函式將與其他核函式重疊。如果為 true，則使用一部分 SM 以減少資源競爭
preferred_sms – 當 use_pipeline=true 時，核函式首選使用的 SM 數量。當 use_pipeline=false 時，此值被忽略。

返回值:

self 張量

Tensor masked_index_select_cuda(Tensor self, Tensor indices, Tensor values, Tensor count, const bool use_pipeline, const int64_t preferred_sms)¶

類似於 torch.index_select，但忽略 indices < 0

masked_index_select_cuda 僅支援 2D 輸入 values。它將 values 中由 indices（其中 indices >= 0）指定的 count 行放入 self 中。

# Equivalent PyTorch Python code
indices = indices[:count]
filter_ = indices >= 0
indices_ = indices[filter_]
self[filter_.nonzero().flatten()] = values[indices_]

引數:

self – 2D 輸出張量
indices – 1D 索引張量
values – 2D 輸入張量（被索引的張量）
count – 包含要處理的 indices 長度的張量
use_pipeline – 一個標誌，指示此核函式將與其他核函式重疊。如果為 true，則使用一部分 SM 以減少資源競爭
preferred_sms – 當 use_pipeline=true 時，核函式首選使用的 SM 數量。當 use_pipeline=false 時，此值被忽略。

返回值:

self 張量

std::tuple<Tensor, Tensor> ssd_generate_row_addrs_cuda(const Tensor &lxu_cache_locations, const Tensor &assigned_cache_slots, const Tensor &linear_index_inverse_indices, const Tensor &unique_indices_count_cumsum, const Tensor &cache_set_inverse_indices, const Tensor &lxu_cache_weights, const Tensor &inserted_ssd_weights, const Tensor &unique_indices_length, const Tensor &cache_set_sorted_unique_indices)¶

為 SSD TBE 資料生成記憶體地址。

從 SSD 檢索的資料可以儲存在暫存區 (HBM) 或 LXU 快取 (同樣是 HBM) 中。lxu_cache_locations 用於指定資料的位置。如果位置為 -1，則關聯索引的資料在暫存區中；否則，它在快取中。為了方便 TBE 核函式訪問資料，此運算元為每個索引生成首位元組的記憶體地址。訪問資料時，TBE 核函式只需將地址轉換為指標。

此外，此運算元還會生成後向驅逐索引的列表，這些索引的資料基本上位於暫存區中。

引數:

lxu_cache_locations – 包含用於儲存完整索引列表資料的快取槽位的張量。-1 是一個指示資料不在快取中的哨兵值。
assigned_cache_slots – 包含用於唯一索引列表的快取槽位的張量。-1 指示資料不在快取中
linear_index_inverse_indices – 包含線性索引排序前原始位置的張量
unique_indices_count_cumsum – 包含唯一索引計數（count）的排他字首和結果的張量
cache_set_inverse_indices_curr – 包含當前迭代中快取集排序前原始位置的張量
lxu_cache_weights – LXU 快取張量
inserted_ssd_weights – 暫存區張量
unique_indices_length – 包含唯一索引數量的張量（GPU 張量）
cache_set_sorted_unique_indices – 包含與排序後的唯一快取集關聯的唯一索引的張量

返回值:

一個張量元組（SSD 行地址張量和後向驅逐索引張量）

void ssd_update_row_addrs_cuda(const Tensor &ssd_row_addrs_curr, const Tensor &inserted_ssd_weights_curr_next_map, const Tensor &lxu_cache_locations_curr, const Tensor &linear_index_inverse_indices_curr, const Tensor &unique_indices_count_cumsum_curr, const Tensor &cache_set_inverse_indices_curr, const Tensor &lxu_cache_weights, const Tensor &inserted_ssd_weights_next, const Tensor &unique_indices_length_curr)¶

更新 SSD TBE 資料的記憶體地址。

啟用管道預取時，當前迭代暫存區中的資料可以在預取步驟期間移動到 L1 或下一迭代的暫存區。此運算元更新已重定位到正確位置的資料的記憶體地址。

引數:

ssd_row_addrs_curr – 包含當前迭代行地址的張量
inserted_ssd_weights_curr_next_map – 包含當前迭代中每個索引在下一迭代暫存區中的位置對映的張量。（-1 = 資料尚未移動）。inserted_ssd_weights_curr_next_map[i] 即為該位置
lxu_cache_locations_curr – 包含用於儲存當前迭代的完整索引列表資料的快取槽位的張量。-1 是一個指示資料不在快取中的哨兵值。
linear_index_inverse_indices_curr – 包含當前迭代中線性索引排序前原始位置的張量
unique_indices_count_cumsum_curr – 包含當前迭代中唯一索引計數（count）的排他字首和結果的張量
cache_set_inverse_indices_curr – 包含當前迭代中快取集排序前原始位置的張量
lxu_cache_weights – LXU 快取張量
inserted_ssd_weights_next – 下一迭代的暫存區張量
unique_indices_length_curr – 包含當前迭代唯一索引數量的張量（GPU 張量）

返回值:

無

void compact_indices_cuda(std::vector<Tensor> compact_indices, Tensor compact_count, std::vector<Tensor> indices, Tensor masks, Tensor count)¶

壓縮給定的索引列表。

此運算元根據給定的掩碼（一個包含 0 或 1 的張量）壓縮給定的索引列表。該運算元移除對應掩碼為 0 的索引。它只對 count 個元素進行操作（而非整個張量）。

示例

indices = [[0, 3, -1, 3, -1, -1, 7], [0, 2, 2, 3, -1, 9, 7]]
masks = [1, 1, 0, 1, 0, 0, 1]
count = 5

# x represents an arbitrary value
compact_indices = [[0, 3, 3, x, x, x, x], [0, 2, 3, x, x, x, x]]
compact_count = 3

引數:

compact_indices – 壓縮索引的列表（輸出索引）。
compact_count – 一個 tensor，包含壓縮後的元素數量
indices – 要壓縮的索引輸入列表
masks – 一個 tensor，包含 0 或 1，用於指示是否刪除/保留元素。0 = 移除對應的索引。1 = 保留對應的索引。@count count 一個 tensor，包含要壓縮的元素數量

class CacheLibCache¶

#include <cachelib_cache.h>

一個用於 Cachelib 互動的 Cachelib 包裝類。

它用於維護所有與快取相關的操作，包括初始化、插入、查詢和逐出。它在逐出邏輯方面是狀態化的，呼叫者必須專門獲取和重置與逐出相關的狀態。Cachelib 相關的最佳化將被捕獲在此類中，例如 fetch 和延遲 markUseful 以提高 get 效能

注意

此類僅處理單個 Cachelib 讀取/更新。並行化在呼叫者端完成

class EmbeddingParameterServer : public EmbeddingKVDB¶: #include <ps_table_batched_embeddings.h>

EmbeddingKVDB 為訓練引數服務 (TPS) 客戶端實現的一個類。

class CacheContext¶

#include <kv_db_table_batched_embeddings.h>

它儲存 l2cache 查詢結果。

num_misses 是 l2 快取查詢中的未命中數量，cached_addr_list 是預分配的，其大小與查詢次數相同，以實現更好的並行性，並且無效位置（快取未命中）將保留 sentinel 值

struct QueueItem¶

#include <kv_db_table_batched_embeddings.h>

用於後臺 L2/rocksdb 更新的佇列項

indices/weights/count 是相應的 set() 引數

read_handles 是 cachelib 抽象的索引/嵌入對元資料，稍後將在更新 cachelib LRU 佇列時使用，因為它與 EmbeddingKVDB::get_cache() 分離

mode 用於監控 rocksdb 寫入，詳細解釋請檢視 RocksdbWriteMode

class EmbeddingKVDB : public std::enable_shared_from_this<EmbeddingKVDB>¶

#include <kv_db_table_batched_embeddings.h>

一個用於與不同快取層和儲存層互動的類，公共呼叫在 cuda stream 上執行。

目前它被 TBE 用於將 Key（Embedding Index）Value（Embeddings）解除安裝到 DRAM、SSD 或遠端儲存，以在不耗盡 HBM 資源的情況下提供更好的可擴充套件性

繼承自 DramKVEmbeddingCache< weight_type >, EmbeddingParameterServer, EmbeddingRocksDB

class EmbeddingRocksDB : public EmbeddingKVDB¶

#include <ssd_table_batched_embeddings.h>

EmbeddingKVDB 為 RocksDB 實現的一個類。

繼承自 MockEmbeddingRocksDB

SSD Embedding 運算元¶

CUDA 運算元¶

文件

教程

資源