注意

點選此處下載完整示例程式碼

詞嵌入：編碼詞彙語義¶

建立日期：2017 年 4 月 8 日 | 最後更新：2021 年 9 月 14 日 | 最後驗證：2024 年 11 月 5 日

詞嵌入是實數的密集向量，詞彙表中的每個詞對應一個向量。在 NLP 中，幾乎總是使用詞作為特徵！但是如何在計算機中表示一個詞呢？你可以儲存它的 ASCII 字元表示，但這隻能告訴你這個詞是什麼，並不能說明它意味著什麼（你可能可以從它的詞綴推匯出詞性，或從其大寫形式推匯出屬性，但也僅此而已）。更重要的是，你如何組合這些表示？我們常常希望從神經網路中得到密集輸出，而輸入是 \(|V|\) 維的（其中 \(V\) 是我們的詞彙表），但輸出通常只有少數幾個維度（例如，如果我們只預測少數幾個標籤）。我們如何從一個巨大的維度空間轉換到一個較小的維度空間？

我們不用 ASCII 表示，而是使用 One-Hot 編碼如何？也就是說，我們將詞 \(w\) 表示為

\[\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements} \]

其中 1 位於 \(w\) 獨有的位置。任何其他詞將在其他某個位置為 1，其餘位置為 0。

這種表示法有一個巨大的缺點，除了它非常龐大之外。它基本上把所有詞都視為獨立的實體，彼此之間沒有關係。我們真正想要的是詞之間的某種相似性概念。為什麼？讓我們看一個例子。

假設我們正在構建一個語言模型。假設我們在訓練資料中看到了句子

The mathematician ran to the store.
The physicist ran to the store.
The mathematician solved the open problem.

在我們訓練資料中。現在假設我們得到一個以前從未在訓練資料中見過的新句子

The physicist solved the open problem.

我們的語言模型可能在這句話上表現不錯，但如果我們可以利用以下兩個事實，會不會好得多？

我們在句子中見過數學家和物理學家扮演相同的角色。某種程度上，他們有語義關係。
我們在新的未見過的句子中見過數學家扮演與現在看到物理學家相同的角色。

然後推斷出物理學家實際上非常適合這個新的未見過的句子？這就是我們所說的相似性概念：我們指的是語義相似性，而不僅僅是具有相似的拼寫表示。這是一種對抗語言資料稀疏性的技術，透過連線我們見過的內容和未見過的內容之間的點來實現。這個例子當然依賴於一個基本的語言學假設：出現在相似上下文中的詞彼此在語義上是相關的。這被稱為分散式假設。

獲取密集詞嵌入¶

我們如何解決這個問題？也就是說，我們如何實際編碼詞語的語義相似性？也許我們可以想出一些語義屬性。例如，我們看到數學家和物理學家都能跑，所以我們可以給這些詞在“能跑步”的語義屬性上打高分。想出其他一些屬性，想象一下你會給一些常用詞在這些屬性上打什麼分。

如果每個屬性是一個維度，那麼我們可能給每個詞一個向量，像這樣

\[ q_\text{mathematician} = \left[ \overbrace{2.3}^\text{能跑步}, \overbrace{9.4}^\text{喜歡咖啡}, \overbrace{-5.5}^\text{主修物理}, \dots \right]\]

\[ q_\text{physicist} = \left[ \overbrace{2.5}^\text{能跑步}, \overbrace{9.1}^\text{喜歡咖啡}, \overbrace{6.4}^\text{主修物理}, \dots \right]\]

然後我們可以透過以下方式獲得這些詞之間的相似度度量

\[\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician} \]

儘管更常見的是按長度歸一化

\[ \text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}} {\| q_\text{physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\]

其中 \(\phi\) 是兩個向量之間的夾角。這樣，極其相似的詞（詞嵌入指向相同方向的詞）相似度將為 1。極其不相似的詞相似度應為 -1。

你可以將本節開頭提到的稀疏 One-Hot 向量視為我們定義的這些新向量的一個特例，其中每個詞基本上相似度為 0，並且我們給每個詞賦予了一些獨特的語義屬性。這些新向量是密集的，也就是說它們的條目（通常）是非零的。

但這些新向量非常麻煩：你可以想到成千上萬個不同的語義屬性可能與確定相似性有關，而你又如何設定不同屬性的值呢？深度學習思想的核心在於神經網路學習特徵的表示，而不是要求程式設計師自己設計它們。那麼為什麼不直接讓詞嵌入作為我們模型中的引數，然後在訓練期間進行更新呢？這正是我們將要做的。我們將有一些潛在的語義屬性，網路原則上可以學習它們。請注意，詞嵌入可能無法解釋。也就是說，儘管在我們上面手工製作的向量中，我們可以看到數學家和物理學家在喜歡咖啡這一點上是相似的，但如果我們讓神經網路學習嵌入，並且看到數學家和物理學家在第二維度中都具有較大值，這並不清楚意味著什麼。它們在某個潛在的語義維度上是相似的，但這可能對我們沒有解釋。

總而言之，詞嵌入是詞語語義的一種表示，有效地編碼了可能與當前任務相關的語義資訊。你也可以嵌入其他東西：詞性標籤、句法樹，任何東西！特徵嵌入的概念是該領域的核心。

PyTorch 中的詞嵌入¶

在我們進入一個具體示例和一個練習之前，先快速介紹一下如何在 PyTorch 和一般的深度學習程式設計中使用嵌入。就像我們在製作 One-Hot 向量時為每個詞定義了一個唯一的索引一樣，在使用嵌入時我們也需要為每個詞定義一個索引。這些索引將是查詢表的鍵。也就是說，嵌入儲存為一個 \(|V| \times D\) 矩陣，其中 \(D\) 是嵌入的維度，詞彙表中索引為 \(i\) 的詞，其嵌入儲存在矩陣的第 \(i\) 行。在我所有的程式碼中，詞到索引的對映是一個名為 word_to_ix 的字典。

允許你使用嵌入的模組是 torch.nn.Embedding，它接受兩個引數：詞彙表大小和嵌入的維度。

要索引此表，你必須使用 torch.LongTensor（因為索引是整數，不是浮點數）。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator object at 0x7f72ce596470>

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)

示例：N-Gram 語言模型¶

回想一下，在 N-Gram 語言模型中，給定一個詞序列 \(w\)，我們希望計算

\[P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) \]

其中 \(w_i\) 是序列中的第 i 個詞。

在這個示例中，我們將計算一些訓練樣本上的損失函式，並使用反向傳播更新引數。

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]
[521.44149518013, 518.8340816497803, 516.2432668209076, 513.668018579483, 511.10753536224365, 508.561292886734, 506.02885699272156, 503.50800943374634, 500.99908232688904, 498.4997034072876]
tensor([-1.8804, -0.7788,  2.0251, -0.0871,  2.3550, -1.0376,  1.5748, -0.6295,
         2.4065,  0.2789], grad_fn=<SelectBackward0>)

練習：計算詞嵌入：連續詞袋模型¶

連續詞袋模型 (CBOW) 經常用於 NLP 深度學習。它是一種嘗試根據目標詞前後幾個詞的上下文來預測目標詞的模型。這與語言模型不同，因為 CBOW 不是序列式的，也不必是機率式的。通常，CBOW 用於快速訓練詞嵌入，然後這些嵌入用於初始化更復雜模型中的嵌入。通常，這被稱為預訓練嵌入。它幾乎總能幫助提升效能百分之幾。

CBOW 模型如下。給定一個目標詞 \(w_i\) 以及兩側各一個 \(N\) 個詞的上下文視窗 \(w_{i-1}, \dots, w_{i-N}\) 和 \(w_{i+1}, \dots, w_{i+N}\)，將所有上下文詞統稱為 \(C\)，CBOW 嘗試最小化

\[-\log p(w_i | C) = -\log \text{Softmax}\left(A(\sum_{w \in C} q_w) + b\right) \]

其中 \(q_w\) 是詞 \(w\) 的嵌入。

透過填充下面的類，在 PyTorch 中實現此模型。一些提示：

思考你需要定義哪些引數。
確保你知道每個操作期望的形狀。如果需要重塑，請使用 .view()。

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# Create your model and train. Here are some functions to help you make
# the data ready for use by your module.


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study'), (['study', 'to', 'idea', 'of'], 'the'), (['the', 'study', 'of', 'a'], 'idea')]

tensor([41, 21, 13, 46])

指令碼總執行時間： ( 0 分鐘 0.456 秒)

由 Sphinx-Gallery 生成的相簿