序列模型和長短期記憶網路¶

建立日期：2017 年 4 月 8 日 | 最後更新：2022 年 1 月 7 日 | 最後驗證：未驗證

至此，我們已經瞭解了各種前饋網路。也就是說，網路根本不維護任何狀態。這可能不是我們想要的行為。序列模型是 NLP 的核心：它們是輸入之間存在某種時間依賴關係的模型。序列模型的典型例子是用於詞性標註的隱馬爾可夫模型。另一個例子是條件隨機場。

迴圈神經網路是一種維護某種狀態的網路。例如，其輸出可以作為下一個輸入的一部分，以便資訊在網路處理序列時沿途傳播。對於 LSTM，序列中的每個元素都有一個對應的隱藏狀態 \(h_t\)，原則上它可以包含序列中任意早前點的資訊。我們可以使用隱藏狀態來預測語言模型中的詞語、詞性標籤以及許多其他事物。

PyTorch 中的 LSTM¶

在進入示例之前，請注意一些事項。PyTorch 的 LSTM 要求其所有輸入都是 3D 張量。這些張量軸的語義很重要。第一軸是序列本身，第二軸索引 mini-batch 中的例項，第三軸索引輸入元素。我們尚未討論 mini-batch，所以暫且忽略它，並假設第二軸始終只有 1 個維度。如果我們要對句子“The cow jumped”執行序列模型，我們的輸入應該看起來像

\[\begin{bmatrix} \overbrace{q_\text{The}}^\text{row vector} \\ q_\text{cow} \\ q_\text{jumped} \end{bmatrix}\]

但請記住，還有一個額外的第 2 維度，大小為 1。

此外，您可以一次處理序列中的一個元素，在這種情況下，第 1 軸的大小也將為 1。

讓我們看一個快速示例。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator object at 0x7f5713d9a470>

lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]], grad_fn=<MkldnnRnnLayerBackward0>)
(tensor([[[-0.3368,  0.0959, -0.0538]]], grad_fn=<StackBackward0>), tensor([[[-0.9825,  0.4715, -0.0633]]], grad_fn=<StackBackward0>))

示例：用於詞性標註的 LSTM¶

在本節中，我們將使用 LSTM 獲取詞性標籤。我們不會使用 Viterbi 或 Forward-Backward 等演算法，但作為一項（有挑戰性的）練習，在您瞭解正在發生的事情後，可以思考如何使用 Viterbi。在此示例中，我們還提及了詞嵌入（embeddings）。如果您不熟悉詞嵌入，可以在此處閱讀相關內容。

模型如下：假設輸入句子為 \(w_1, \dots, w_M\)，其中 \(w_i \in V\) 是我們的詞彙表。另外，設 \(T\) 是我們的標籤集，\(y_i\) 是詞語 \(w_i\) 的標籤。將我們對詞語 \(w_i\) 標籤的預測表示為 \(\hat{y}_i\)。

這是一個結構化預測模型，其輸出是序列 \(\hat{y}_1, \dots, \hat{y}_M\)，其中 \(\hat{y}_i \in T\)。

為了進行預測，對句子應用一個 LSTM。將時間步 \(i\) 的隱藏狀態表示為 \(h_i\)。此外，為每個標籤分配一個唯一索引（就像我們在詞嵌入章節中的 word_to_ix 一樣）。那麼，我們對 \(\hat{y}_i\) 的預測規則是

\[\hat{y}_i = \text{argmax}_j \ (\log \text{Softmax}(Ah_i + b))_j \]

也就是說，取隱藏狀態經過仿射對映後的 log softmax，預測的標籤就是該向量中具有最大值的標籤。請注意，這立即表明 \(A\) 的目標空間的維度是 \(|T|\)。

準備資料

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}

建立模型

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

訓練模型

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)

tensor([[-1.1389, -1.2024, -0.9693],
        [-1.1065, -1.2200, -0.9834],
        [-1.1286, -1.2093, -0.9726],
        [-1.1190, -1.1960, -0.9916],
        [-1.0137, -1.2642, -1.0366]])
tensor([[-0.0462, -4.0106, -3.6096],
        [-4.8205, -0.0286, -3.9045],
        [-3.7876, -4.1355, -0.0394],
        [-0.0185, -4.7874, -4.6013],
        [-5.7881, -0.0186, -4.1778]])

練習：使用字元級特徵增強 LSTM 詞性標註器¶

在上面的示例中，每個詞語都有一個詞嵌入，作為我們序列模型的輸入。讓我們使用從詞語字元派生的表示來增強詞嵌入。我們期望這將顯著有所幫助，因為像詞綴這樣的字元級資訊對詞性有很大的影響。例如，帶有詞綴 -ly 的詞語在英語中幾乎總是被標記為副詞。

為此，設 \(c_w\) 是詞語 \(w\) 的字元級表示。設 \(x_w\) 是之前的詞嵌入。那麼我們序列模型的輸入就是 \(x_w\) 和 \(c_w\) 的拼接。因此，如果 \(x_w\) 的維度是 5，\(c_w\) 的維度是 3，那麼我們的 LSTM 應該接受一個維度為 8 的輸入。

要獲得字元級表示，對詞語的字元應用一個 LSTM，並讓 \(c_w\) 成為此 LSTM 的最終隱藏狀態。提示

您的新模型中將有兩個 LSTM。原始的那個輸出詞性標籤分數，新的那個輸出每個詞語的字元級表示。
要對字元構建序列模型，您需要嵌入字元。字元嵌入將是字元 LSTM 的輸入。

指令碼總執行時間： ( 0 分鐘 0.506 秒)

由 Sphinx-Gallery 生成的相簿

序列模型和長短期記憶網路¶

PyTorch 中的 LSTM¶

示例：用於詞性標註的 LSTM¶

練習：使用字元級特徵增強 LSTM 詞性標註器¶

文件

教程

資源