注意

點選此處下載完整示例程式碼

NLP 從零開始：使用字元級 RNN 生成姓名¶

創建於: 2017 年 3 月 24 日 | 最後更新於: 2024 年 10 月 21 日 | 最後驗證於: 2024 年 11 月 5 日

本教程是三部分系列教程的一部分

這是我們“NLP 從零開始”系列三部分教程中的第二部分。在第一個教程中，我們使用了 RNN 將姓名分類到其語言來源。這次我們將反過來，從語言生成姓名。

> python sample.py Russian RUS
Rovakov
Uantov
Shavakov

> python sample.py German GER
Gerren
Ereng
Rosher

> python sample.py Spanish SPA
Salla
Parer
Allan

> python sample.py Chinese CHI
Chan
Hang
Iun

我們仍然手工構建了一個帶有幾個線性層的小型 RNN。主要區別在於，我們不再是在讀取所有字母后預測類別，而是輸入一個類別並一次輸出一個字母。迴圈預測字元以形成語言（這也可以使用單詞或其他更高階的結構來完成）通常被稱為“語言模型”。

推薦閱讀

我假設您至少已安裝 PyTorch，瞭解 Python，並理解張量 (Tensors)

https://pytorch.com.tw/ 安裝說明
使用 PyTorch 進行深度學習：60 分鐘速成瞭解 PyTorch 的基本用法
透過示例學習 PyTorch 獲取廣泛而深入的概覽
PyTorch for Former Torch Users 如果您是 Lua Torch 使用者

瞭解 RNN 及其工作原理也會有所幫助

迴圈神經網路的驚人有效性展示了許多實際示例
理解 LSTM 網路專門介紹了 LSTM，但對於理解 RNN 總體上也有幫助

我還建議閱讀前一個教程：NLP 從零開始：使用字元級 RNN 對姓名進行分類

準備資料¶

注意

從這裡下載資料並將其解壓到當前目錄。

有關此過程的更多詳細資訊，請參見上一個教程。簡而言之，有許多純文字檔案 data/names/[Language].txt，每行一個姓名。我們將行拆分成陣列，將 Unicode 轉換為 ASCII，最終得到一個字典 {語言: [姓名 ...]}。

from io import open
import glob
import os
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'-"
n_letters = len(all_letters) + 1 # Plus EOS marker

def findFiles(path): return glob.glob(path)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Read a file and split into lines
def readLines(filename):
    with open(filename, encoding='utf-8') as some_file:
        return [unicodeToAscii(line.strip()) for line in some_file]

# Build the category_lines dictionary, a list of lines per category
category_lines = {}
all_categories = []
for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

if n_categories == 0:
    raise RuntimeError('Data not found. Make sure that you downloaded data '
        'from https://download.pytorch.org/tutorial/data.zip and extract it to '
        'the current directory.')

print('# categories:', n_categories, all_categories)
print(unicodeToAscii("O'Néàl"))

# categories: 18 ['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese']
O'Neal

建立網路¶

這個網路擴充套件了上一個教程的 RNN，增加了一個用於類別張量 (category tensor) 的額外引數，該引數與其他張量連線在一起。類別張量與字母輸入一樣，是一個獨熱 (one-hot) 向量。

我們將把輸出解釋為下一個字母的機率。在取樣時，機率最高的輸出字母用作下一個輸入字母。

我添加了第二個線性層 o2o (在組合隱藏層和輸出層之後)，以賦予它更強大的處理能力。還有一個 Dropout 層，它會以給定的機率 (此處為 0.1) 隨機將部分輸入歸零，通常用於模糊輸入以防止過擬合。在這裡，我們將其用在網路的末端，有意地增加一些隨機性並增加取樣多樣性。

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        input_combined = torch.cat((category, input, hidden), 1)
        hidden = self.i2h(input_combined)
        output = self.i2o(input_combined)
        output_combined = torch.cat((hidden, output), 1)
        output = self.o2o(output_combined)
        output = self.dropout(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

訓練¶

準備訓練¶

首先，獲取隨機 (類別, 行) 對的輔助函式

import random

# Random item from a list
def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

# Get a random category and random line from that category
def randomTrainingPair():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    return category, line

對於每個時間步（即，訓練單詞中的每個字母），網路的輸入將是 (類別, 當前字母, 隱藏狀態)，輸出將是 (下一個字母, 下一個隱藏狀態)。因此，對於每個訓練集，我們將需要類別、一組輸入字母和一組輸出/目標字母。

由於我們在每個時間步從當前字母預測下一個字母，字母對是來自該行的連續字母組 - 例如，對於 "ABCD<EOS>"，我們將建立 ("A", "B")、("B", "C")、("C", "D")、("D", "EOS")。

類別張量是一個大小為 <1 x n_categories> 的獨熱張量。訓練時，我們在每個時間步將其饋送給網路——這是一個設計選擇，它也可以作為初始隱藏狀態的一部分或其他策略包含進來。

# One-hot vector for category
def categoryTensor(category):
    li = all_categories.index(category)
    tensor = torch.zeros(1, n_categories)
    tensor[0][li] = 1
    return tensor

# One-hot matrix of first to last letters (not including EOS) for input
def inputTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

# ``LongTensor`` of second letter to end (EOS) for target
def targetTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    return torch.LongTensor(letter_indexes)

為了訓練方便，我們將建立一個 randomTrainingExample 函式，用於獲取隨機的 (類別, 行) 對，並將其轉換為所需的 (類別, 輸入, 目標) 張量。

# Make category, input, and target tensors from a random category, line pair
def randomTrainingExample():
    category, line = randomTrainingPair()
    category_tensor = categoryTensor(category)
    input_line_tensor = inputTensor(line)
    target_line_tensor = targetTensor(line)
    return category_tensor, input_line_tensor, target_line_tensor

訓練網路¶

與僅使用最後一個輸出的分類不同，我們在每個步驟進行預測，因此我們在每個步驟計算損失。

autograd 的神奇之處在於，您可以簡單地將每個步驟的這些損失相加，並在最後呼叫 backward。

criterion = nn.NLLLoss()

learning_rate = 0.0005

def train(category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)
    hidden = rnn.initHidden()

    rnn.zero_grad()

    loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

    loss.backward()

    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item() / input_line_tensor.size(0)

為了跟蹤訓練所需的時間，我添加了一個 timeSince(timestamp) 函式，該函式返回一個人可讀的字串。

import time
import math

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

訓練照常進行——多次呼叫 train 並等待幾分鐘，每隔 print_every 個示例列印當前時間和損失，並每隔 plot_every 個示例將平均損失儲存在 all_losses 中，以便稍後繪製。

rnn = RNN(n_letters, 128, n_letters)

n_iters = 100000
print_every = 5000
plot_every = 500
all_losses = []
total_loss = 0 # Reset every ``plot_every`` ``iters``

start = time.time()

for iter in range(1, n_iters + 1):
    output, loss = train(*randomTrainingExample())
    total_loss += loss

    if iter % print_every == 0:
        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))

    if iter % plot_every == 0:
        all_losses.append(total_loss / plot_every)
        total_loss = 0

0m 10s (5000 5%) 2.7686
0m 20s (10000 10%) 2.5023
0m 31s (15000 15%) 3.0934
0m 42s (20000 20%) 2.2183
0m 52s (25000 25%) 2.7072
1m 3s (30000 30%) 2.0118
1m 13s (35000 35%) 2.6332
1m 24s (40000 40%) 2.5150
1m 34s (45000 45%) 2.1207
1m 45s (50000 50%) 2.8377
1m 55s (55000 55%) 2.6050
2m 5s (60000 60%) 2.5402
2m 16s (65000 65%) 2.7429
2m 26s (70000 70%) 1.5367
2m 37s (75000 75%) 0.8677
2m 47s (80000 80%) 3.1882
2m 58s (85000 85%) 1.9431
3m 8s (90000 90%) 2.7269
3m 19s (95000 95%) 3.2194
3m 29s (100000 100%) 1.7136

繪製損失曲線¶

繪製 all_losses 中的歷史損失曲線顯示了網路的學習情況

import matplotlib.pyplot as plt

plt.figure()
plt.plot(all_losses)

[<matplotlib.lines.Line2D object at 0x7f12a1748f70>]

網路取樣¶

要進行取樣，我們給網路一個字母並詢問下一個字母是什麼，將該字母作為下一個輸入，重複直到 EOS 標記。

建立輸入類別、起始字母和空隱藏狀態的張量
建立一個以起始字母開頭的字串 output_name
在最大輸出長度內，
- 將當前字母饋送給網路
- 從最高機率輸出中獲取下一個字母，以及下一個隱藏狀態
- 如果字母是 EOS，則停止
- 如果是一個普通字母，新增到 output_name 並繼續
返回最終的姓名

注意

另一種策略是，不必給定起始字母，而是在訓練中包含一個“字串開始”標記，讓網路選擇自己的起始字母。

max_length = 20

# Sample from a category and starting letter
def sample(category, start_letter='A'):
    with torch.no_grad():  # no need to track history in sampling
        category_tensor = categoryTensor(category)
        input = inputTensor(start_letter)
        hidden = rnn.initHidden()

        output_name = start_letter

        for i in range(max_length):
            output, hidden = rnn(category_tensor, input[0], hidden)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == n_letters - 1:
                break
            else:
                letter = all_letters[topi]
                output_name += letter
            input = inputTensor(letter)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters='ABC'):
    for start_letter in start_letters:
        print(sample(category, start_letter))

samples('Russian', 'RUS')

samples('German', 'GER')

samples('Spanish', 'SPA')

samples('Chinese', 'CHI')

Rovakov
Uakinov
Shantov
Gangeng
Erenger
Ronger
Sarera
Parer
Aras
Chan
Han
Iu

練習¶

嘗試使用不同的類別 -> 行資料集，例如
- 虛構系列 -> 角色姓名
- 詞性 -> 單詞
- 國家 -> 城市
使用“句子開始”標記，以便無需選擇起始字母即可進行取樣
使用更大和/或形狀更好的網路獲得更好的結果
- 嘗試使用 nn.LSTM 和 nn.GRU 層
- 將多個這樣的 RNN 組合成一個更高層次的網路

指令碼總執行時間： ( 3 分 29.841 秒)

畫廊由 Sphinx-Gallery 生成