注意

點選此處下載完整的示例程式碼

（原型）使用 MaskedTensor 為 Adagrad 高效編寫“稀疏”語義¶

創建於：2022 年 10 月 28 日 | 最後更新於：2022 年 10 月 28 日 | 最後驗證：未驗證

在學習本教程之前，請先閱讀 MaskedTensor 概述和稀疏性教程。

引言和動機¶

Issue 1369 討論了在為 Adagrad 編寫“稀疏”語義時引入的額外程式碼行，但實際上，程式碼是使用稀疏性作為掩碼語義的代理，而不是稀疏性的預期用例：一種壓縮和最佳化技術。之前，我們透過引入一次性語義和運算元來彌補正式掩碼語義的缺失，同時強制使用者瞭解索引和值等儲存細節。

現在我們有了掩碼語義，就可以更好地指出何時將稀疏性用作語義擴充套件。我們還將比較和對比使用 MaskedTensor 編寫的等效程式碼。最後，將重複顯示程式碼片段，但不包含額外註釋，以展示程式碼簡潔性的差異。

準備工作¶

import torch
import warnings

# Disable prototype warnings and such
warnings.filterwarnings(action='ignore', category=UserWarning)

# Some hyperparameters
eps = 1e-10
clr = 0.1

i = torch.tensor([[0, 1, 1], [2, 0, 2]])
v = torch.tensor([3, 4, 5], dtype=torch.float32)
grad = torch.sparse_coo_tensor(i, v, [2, 4])

使用 MaskedTensor 簡化程式碼¶

在我們深入細節之前，讓我們更具體地介紹一下這個問題。我們將考察 PyTorch 中 Adagrad（函式式）的實現，最終目標是簡化並更忠實地表示掩碼方法。

作為參考，這是沒有掩碼梯度或稀疏性的常規密集程式碼路徑

state_sum.addcmul_(grad, grad, value=1)
std = state_sum.sqrt().add_(eps)
param.addcdiv_(grad, std, value=-clr)

針對稀疏張量的原生實現是

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()

state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))   # a different _make_sparse per layout
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

而 MaskedTensor 將程式碼精簡為以下片段

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)
std2 = std2.sqrt().add(eps)
param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

在本教程中，我們將逐行講解每種實現，但乍一看，我們可以注意到 (1) MaskedTensor 實現程式碼簡潔得多，以及 (2) 它如何避免密集張量和稀疏張量之間的轉換。

原始稀疏實現¶

現在，讓我們透過一些內聯註釋來分解程式碼

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

# We don't support sparse gradients
param = torch.arange(8).reshape(2, 4).float()
state_sum = torch.full_like(param, 0.5)  # initial value for state sum

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()
# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero
state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))

# We take care to make std sparse, even though state_sum clearly is not.
# This means that we're only applying the gradient to parts of the state_sum
# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.
# We currently dodge all these concerns using the private method `_values`.
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)

# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,
# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a
# sparse tensor with `make_sparse`.
# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote
# undefined / undefined = undefined!
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

倒數第三行 – std = state_sum.sparse_mask(grad) – 是一個非常重要的分歧點。

eps 的新增在技術上應該應用於所有值，但實際上只應用於指定的值。在這裡，我們將稀疏性用作語義擴充套件，並強制執行某種已定義值和未定義值的模式。如果梯度的部分值為零，即使它們可以透過其他稀疏儲存佈局進行壓縮，在具體化時仍會被包含。這在理論上是相當脆弱的！話雖如此，有人可能會認為 eps 總是非常小，所以在實踐中可能不太重要。

此外，作為儲存佈局和壓縮方案的稀疏性的 add_ 實現應該導致密集化，但為了效能，我們強制它不這樣做。對於這個一次性的情況來說還好……直到我們想引入新的壓縮方案，例如 CSC、BSR 或 BSC。那時我們將需要為每種格式引入單獨的 Tensor 型別，併為使用不同儲存格式壓縮的梯度編寫變體，這是不方便且不太可擴充套件也不夠整潔的。

MaskedTensor 稀疏實現¶

我們一直在混淆將稀疏性作為一種最佳化與將稀疏性作為 PyTorch 的語義擴充套件。MaskedTensor 提出將稀疏性最佳化與語義擴充套件解耦；例如，目前我們無法實現稀疏儲存的密集語義或密集儲存的掩碼語義。MaskedTensor 透過有意將儲存與語義分離來實現這些想法。

考慮使用掩碼梯度的上述示例

# Let's now import MaskedTensor!
from torch.masked import masked_tensor

# Create an entirely new set of parameters to avoid errors
param2 = torch.arange(8).reshape(2, 4).float()
state_sum2 = torch.full_like(param, 0.5)  # initial value for state sum

mask = (grad.to_dense() != 0).to_sparse()
masked_grad = masked_tensor(grad, mask)

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)

# We can add support for in-place operations later. Notice how this doesn't
# need to access any storage internals and is in general a lot shorter
std2 = std2.sqrt().add(eps)

param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

請注意，這兩種實現看起來非常相似，但 MaskedTensor 實現更短、更簡單。特別是，圍繞 _make_sparse 的許多樣板程式碼（以及需要為每種佈局提供單獨實現）都由 MaskedTensor 為使用者處理了。

現在，讓我們列印此版本和原始版本，以便更容易比較

print("state_sum:\n", state_sum)
print("state_sum2:\n", state_sum2)

state_sum:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])
state_sum2:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])

print("std:\n", std)
print("std2:\n", std2)

std:
 tensor(indices=tensor([[0, 1, 1],
                       [2, 0, 2]]),
       values=tensor([3.0822, 4.0620, 5.0498]),
       size=(2, 4), nnz=3, layout=torch.sparse_coo)
std2:
 MaskedTensor(
  [
    [      --,       --,   3.0822,       --],
    [  4.0620,       --,   5.0498,       --]
  ]
)

print("param:\n", param)
print("param2:\n", param2)

param:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])
param2:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

結論¶

在本教程中，我們討論了原生掩碼語義如何為 PyTorch 中 Adagrad 的現有實現提供更簡潔的開發體驗，該實現曾使用稀疏性作為編寫掩碼語義的代理。但更重要的是，透過 MaskedTensor 使掩碼語義成為一等公民，消除了對稀疏性或不可靠技巧來模擬掩碼的依賴，從而實現了適當的獨立性和開發，同時支援了稀疏語義，就像本例所示。

進一步閱讀¶

要繼續瞭解更多資訊，您可以檢視我們（目前）關於 MaskedTensor 高階語義的最後回顧，以瞭解 MaskedTensor 與 NumPy 的 MaskedArray 在設計決策上的一些差異，以及歸約語義。

指令碼總執行時間： ( 0 分鐘 0.011 秒)

由 Sphinx-Gallery 生成的 Gallery