注意

點選此處下載完整示例程式碼

使用 CUDA CTC 解碼器進行 ASR 推斷¶

作者: Yuekai Zhang

本教程展示瞭如何使用基於 CUDA 的 CTC 束搜尋解碼器執行語音識別推斷。我們將以 Next-gen Kaldi 專案中的預訓練 Zipformer 模型為例進行演示。

概述¶

束搜尋解碼的工作原理是迭代地使用下一個可能的字元擴充套件文字假設（束），並在每個時間步僅保留得分最高的假設。

底層實現使用 CUDA 加速整個解碼過程: 解碼器的數學公式可以在

這篇論文中找到，更詳細的演算法可以在這篇部落格中找到。

使用 CUDA CTC 束搜尋解碼器執行 ASR 推斷需要以下元件

聲學模型：根據聲學特徵預測建模單元（在本教程中為 BPE）的模型
BPE 模型：位元組對編碼 (BPE) 分詞器檔案

聲學模型和設定¶

首先，我們匯入必要的工具並獲取我們將要處理的資料

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

2.7.0
2.7.0

import time
from pathlib import Path

import IPython
import sentencepiece as spm
from torchaudio.models.decoder import cuda_ctc_decoder
from torchaudio.utils import download_asset

我們使用在 LibriSpeech 資料集上訓練的預訓練 Zipformer 模型。該模型使用 CTC 和 Transducer 損失函式進行聯合訓練。在本教程中，我們僅使用模型的 CTC 部分。

def download_asset_external(url, key):
    path = Path(torch.hub.get_dir()) / "torchaudio" / Path(key)
    if not path.exists():
        path.parent.mkdir(parents=True, exist_ok=True)
        torch.hub.download_url_to_file(url, path)
    return str(path)


url_prefix = "https://huggingface.tw/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01"
model_link = f"{url_prefix}/resolve/main/exp/cpu_jit.pt"
model_path = download_asset_external(model_link, "cuda_ctc_decoder/cpu_jit.pt")

  0%|          | 0.00/269M [00:00<?, ?B/s]
 19%|#9        | 51.5M/269M [00:00<00:00, 539MB/s]
 38%|###8      | 103M/269M [00:00<00:00, 500MB/s]
 60%|#####9    | 161M/269M [00:00<00:00, 545MB/s]
 81%|########  | 218M/269M [00:00<00:00, 567MB/s]
100%|##########| 269M/269M [00:00<00:00, 559MB/s]

我們將從 LibriSpeech test-other 資料集中載入一個樣本。

speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
waveform, sample_rate = torchaudio.load(speech_file)
assert sample_rate == 16000
IPython.display.Audio(speech_file)

  0%|          | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 103MB/s]

此音訊檔案對應的文字內容為

i really was very much afraid of showing him how much shocked i was at some parts of what he said

解碼器所需檔案和資料¶

接下來，我們從 BPE 模型載入 token，這是用於解碼的分詞器。

Token¶

# tokens
<blk>
<sos/eos>
<unk>
S
_THE
_A
T
_AND
...

bpe_link = f"{url_prefix}/resolve/main/data/lang_bpe_500/bpe.model"
bpe_path = download_asset_external(bpe_link, "cuda_ctc_decoder/bpe.model")

bpe_model = spm.SentencePieceProcessor()
bpe_model.load(bpe_path)
tokens = [bpe_model.id_to_piece(id) for id in range(bpe_model.get_piece_size())]
print(tokens)

  0%|          | 0.00/239k [00:00<?, ?B/s]
100%|##########| 239k/239k [00:00<00:00, 83.3MB/s]
['<blk>', '<sos/eos>', '<unk>', 'S', '▁THE', '▁A', 'T', '▁AND', 'ED', '▁OF', '▁TO', 'E', 'D', 'N', 'ING', '▁IN', 'Y', 'M', 'C', '▁I', 'A', 'P', '▁HE', 'R', 'O', 'L', 'RE', 'I', 'U', 'ER', '▁IT', 'LY', '▁THAT', '▁WAS', '▁', '▁S', 'AR', '▁BE', 'F', '▁C', 'IN', 'B', '▁FOR', 'OR', 'LE', "'", '▁HIS', '▁YOU', 'AL', '▁RE', 'V', '▁B', 'G', 'RI', '▁E', '▁WITH', '▁T', '▁AS', 'LL', '▁P', '▁HER', 'ST', '▁HAD', '▁SO', '▁F', 'W', 'CE', '▁IS', 'ND', '▁NOT', 'TH', '▁BUT', 'EN', '▁SHE', '▁ON', 'VE', 'ON', 'SE', '▁DE', 'UR', '▁G', 'CH', 'K', 'TER', '▁AT', 'IT', '▁ME', 'RO', 'NE', 'RA', 'ES', 'IL', 'NG', 'IC', '▁NO', '▁HIM', 'ENT', 'IR', '▁WE', 'H', '▁DO', '▁ALL', '▁HAVE', 'LO', '▁BY', '▁MY', '▁MO', '▁THIS', 'LA', '▁ST', '▁WHICH', '▁CON', '▁THEY', 'CK', 'TE', '▁SAID', '▁FROM', '▁GO', '▁WHO', '▁TH', '▁OR', '▁D', '▁W', 'VER', 'LI', '▁SE', '▁ONE', '▁CA', '▁AN', '▁LA', '▁WERE', 'EL', '▁HA', '▁MAN', '▁FA', '▁EX', 'AD', '▁SU', 'RY', '▁MI', 'AT', '▁BO', '▁WHEN', 'AN', 'THER', 'PP', 'ATION', '▁FI', '▁WOULD', '▁PRO', 'OW', 'ET', '▁O', '▁THERE', '▁HO', 'ION', '▁WHAT', '▁FE', '▁PA', 'US', 'MENT', '▁MA', 'UT', '▁OUT', '▁THEIR', '▁IF', '▁LI', '▁K', '▁WILL', '▁ARE', 'ID', '▁RO', 'DE', 'TION', '▁WA', 'PE', '▁UP', '▁SP', '▁PO', 'IGHT', '▁UN', 'RU', '▁LO', 'AS', 'OL', '▁LE', '▁BEEN', '▁SH', '▁RA', '▁SEE', 'KE', 'UL', 'TED', '▁SA', 'UN', 'UND', 'ANT', '▁NE', 'IS', '▁THEM', 'CI', 'GE', '▁COULD', '▁DIS', 'OM', 'ISH', 'HE', 'EST', '▁SOME', 'ENCE', 'ITY', 'IVE', '▁US', '▁MORE', '▁EN', 'ARD', 'ATE', '▁YOUR', '▁INTO', '▁KNOW', '▁CO', 'ANCE', '▁TIME', '▁WI', '▁YE', 'AGE', '▁NOW', 'TI', 'FF', 'ABLE', '▁VERY', '▁LIKE', 'AM', 'HI', 'Z', '▁OTHER', '▁THAN', '▁LITTLE', '▁DID', '▁LOOK', 'TY', 'ERS', '▁CAN', '▁CHA', '▁AR', 'X', 'FUL', 'UGH', '▁BA', '▁DAY', '▁ABOUT', 'TEN', 'IM', '▁ANY', '▁PRE', '▁OVER', 'IES', 'NESS', 'ME', 'BLE', '▁M', 'ROW', '▁HAS', '▁GREAT', '▁VI', 'TA', '▁AFTER', 'PER', '▁AGAIN', 'HO', 'SH', '▁UPON', '▁DI', '▁HAND', '▁COM', 'IST', 'TURE', '▁STA', '▁THEN', '▁SHOULD', '▁GA', 'OUS', 'OUR', '▁WELL', '▁ONLY', 'MAN', '▁GOOD', '▁TWO', '▁MAR', '▁SAY', '▁HU', 'TING', '▁OUR', 'RESS', '▁DOWN', 'IOUS', '▁BEFORE', '▁DA', '▁NA', 'QUI', '▁MADE', '▁EVERY', '▁OLD', '▁EVEN', 'IG', '▁COME', '▁GRA', '▁RI', '▁LONG', 'OT', 'SIDE', 'WARD', '▁FO', '▁WHERE', 'MO', 'LESS', '▁SC', '▁MUST', '▁NEVER', '▁HOW', '▁CAME', '▁SUCH', '▁RU', '▁TAKE', '▁WO', '▁CAR', 'UM', 'AK', '▁THINK', '▁MUCH', '▁MISTER', '▁MAY', '▁JO', '▁WAY', '▁COMP', '▁THOUGHT', '▁STO', '▁MEN', '▁BACK', '▁DON', 'J', '▁LET', '▁TRA', '▁FIRST', '▁JUST', '▁VA', '▁OWN', '▁PLA', '▁MAKE', 'ATED', '▁HIMSELF', '▁WENT', '▁PI', 'GG', 'RING', '▁DU', '▁MIGHT', '▁PART', '▁GIVE', '▁IMP', '▁BU', '▁PER', '▁PLACE', '▁HOUSE', '▁THROUGH', 'IAN', '▁SW', '▁UNDER', 'QUE', '▁AWAY', '▁LOVE', 'QUA', '▁LIFE', '▁GET', '▁WITHOUT', '▁PASS', '▁TURN', 'IGN', '▁HEAD', '▁MOST', '▁THOSE', '▁SHALL', '▁EYES', '▁COL', '▁STILL', '▁NIGHT', '▁NOTHING', 'ITION', 'HA', '▁TELL', '▁WORK', '▁LAST', '▁NEW', '▁FACE', '▁HI', '▁WORD', '▁FOUND', '▁COUNT', '▁OB', '▁WHILE', '▁SHA', '▁MEAN', '▁SAW', '▁PEOPLE', '▁FRIEND', '▁THREE', '▁ROOM', '▁SAME', '▁THOUGH', '▁RIGHT', '▁CHILD', '▁FATHER', '▁ANOTHER', '▁HEART', '▁WANT', '▁TOOK', 'OOK', '▁LIGHT', '▁MISSUS', '▁OPEN', '▁JU', '▁ASKED', 'PORT', '▁LEFT', '▁JA', '▁WORLD', '▁HOME', '▁WHY', '▁ALWAYS', '▁ANSWER', '▁SEEMED', '▁SOMETHING', '▁GIRL', '▁BECAUSE', '▁NAME', '▁TOLD', '▁NI', '▁HIGH', 'IZE', '▁WOMAN', '▁FOLLOW', '▁RETURN', '▁KNEW', '▁EACH', '▁KIND', '▁JE', '▁ACT', '▁LU', '▁CERTAIN', '▁YEARS', '▁QUITE', '▁APPEAR', '▁BETTER', '▁HALF', '▁PRESENT', '▁PRINCE', 'SHIP', '▁ALSO', '▁BEGAN', '▁HAVING', '▁ENOUGH', '▁PERSON', '▁LADY', '▁WHITE', '▁COURSE', '▁VOICE', '▁SPEAK', '▁POWER', '▁MORNING', '▁BETWEEN', '▁AMONG', '▁KEEP', '▁WALK', '▁MATTER', '▁TEA', '▁BELIEVE', '▁SMALL', '▁TALK', '▁FELT', '▁HORSE', '▁MYSELF', '▁SIX', '▁HOWEVER', '▁FULL', '▁HERSELF', '▁POINT', '▁STOOD', '▁HUNDRED', '▁ALMOST', '▁SINCE', '▁LARGE', '▁LEAVE', '▁PERHAPS', '▁DARK', '▁SUDDEN', '▁REPLIED', '▁ANYTHING', '▁WONDER', '▁UNTIL', 'Q']

Token 是聲學模型可以預測的可能符號，包括 CTC 中的空白符號。在本教程中，它包含 500 個 BPE token。它可以作為檔案傳入，其中每行包含對應於同一索引的 token；或者作為 token 列表傳入，其中每個 token 對映到一個唯一的索引。

構建 CUDA 解碼器¶

cuda_decoder = cuda_ctc_decoder(tokens, nbest=10, beam_size=10, blank_skip_threshold=0.95)

在本教程中，我們將構建一個 CUDA 束搜尋解碼器。可以使用工廠函式 `cuda_ctc_decoder()` 來構建解碼器。

執行推斷¶

i really was very much afraid of showing him how much shocked i was at some parts of what he said

actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()

device = torch.device("cuda", 0)
acoustic_model = torch.jit.load(model_path)
acoustic_model.to(device)
acoustic_model.eval()

waveform = waveform.to(device)

feat = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=80, snip_edges=False)
feat = feat.unsqueeze(0)
feat_lens = torch.tensor(feat.size(1), device=device).unsqueeze(0)

encoder_out, encoder_out_lens = acoustic_model.encoder(feat, feat_lens)
nnet_output = acoustic_model.ctc_output(encoder_out)
log_prob = torch.nn.functional.log_softmax(nnet_output, -1)

print(f"The shape of log_prob: {log_prob.shape}, the shape of encoder_out_lens: {encoder_out_lens.shape}")

The shape of log_prob: torch.Size([1, 175, 500]), the shape of encoder_out_lens: torch.Size([1])

現在我們已經有了資料、聲學模型和解碼器，我們可以執行推斷了。束搜尋解碼器的輸出型別為 CUCTCHypothesis，包含預測的 token ID、單詞（與 token ID 對應的符號）以及假設得分。回想一下，與波形對應的文字內容是

results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
beam_search_transcript = bpe_model.decode(results[0][0].tokens).lower()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_transcript.split()) / len(
    actual_transcript
)

print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")

Transcript: i really was very much afraid of showing him how much shocked i was at some parts of what he said
WER: 0.0

cuda ctc 解碼器給出以下結果。

束搜尋解碼器引數¶

在本節中，我們將更深入地討論一些不同的引數和權衡。有關可定製引數的完整列表，請參閱 `文件`。

def print_decoded(cuda_decoder, bpe_model, log_prob, encoder_out_lens, param, param_value):
    start_time = time.monotonic()
    results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
    decode_time = time.monotonic() - start_time
    transcript = bpe_model.decode(results[0][0].tokens).lower()
    score = results[0][0].score
    print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")

輔助函式¶

nbest¶

for i in range(10):
    transcript = bpe_model.decode(results[0][i].tokens).lower()
    score = results[0][i].score
    print(f"{transcript} (score: {score})")

i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20280733704566956)
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -1.7408883571624756)
i really was very much afraid of sheowing him how much shocked i was at some parts of what he said (score: -6.67951774597168)
i reallyly very much afraid of showing him how much shocked i was at some parts of what he said (score: -7.597038745880127)
i really was very much afraid of sheowing him how much shocked i was at some part of what he said (score: -8.224080085754395)
i really was very much afraid of shwing him how much shocked i was at some parts of what he said (score: -8.439373970031738)
i really was very much afraid of showing him how much shocked i was in some parts of what he said (score: -8.781461715698242)
i really was very much afraid of showing him how much shocked i was at some parts of what said (score: -8.883706092834473)
i really was very much afraid of showing him how much shocked i was at some partes of what he said (score: -8.999059677124023)
i really was very much afraid of showing him how much shocked i was at some parts of what he say (score: -9.138861656188965)

此引數指示返回的最佳假設數量。例如，透過在之前構建束搜尋解碼器時設定 `nbest=10`，我們現在可以訪問得分最高的 10 個假設。

beam size¶

引數 beam_size 決定了每個解碼步驟後保留的最佳假設的最大數量。使用更大的束大小可以探索更廣泛的可能假設，從而產生更高得分的假設，但在達到某個點之後不會帶來額外的收益。我們建議將 cuda 束搜尋解碼器的 beam_size 設定為 10。

beam_sizes = [1, 2, 3, 10]

for beam_size in beam_sizes:
    beam_search_decoder = cuda_ctc_decoder(
        tokens,
        nbest=1,
        beam_size=beam_size,
        blank_skip_threshold=0.95,
    )
    print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "beam size", beam_size)

beam size 1  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -1.35; 0.0010 secs)
beam size 2  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0009 secs)
beam size 3  : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0009 secs)
beam size 10 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)

在下面的示例中，我們看到隨著束大小從 1 增加到 3，解碼質量有所提高，但請注意，使用束大小為 3 與束大小為 10 提供了相同的輸出。

blank skip threshold¶

blank_skip_probs = [0.25, 0.95, 1.0]

for blank_skip_prob in blank_skip_probs:
    beam_search_decoder = cuda_ctc_decoder(
        tokens,
        nbest=10,
        beam_size=10,
        blank_skip_threshold=blank_skip_prob,
    )
    print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "blank_skip_threshold", blank_skip_prob)

del cuda_decoder

blank_skip_threshold 0.25: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -0.01; 0.0009 secs)
blank_skip_threshold 0.95: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
blank_skip_threshold 1.0: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0043 secs)

引數 `blank_skip_threshold` 用於剪枝具有較大空白機率的幀。使用合適的 `blank_skip_threshold` 剪枝這些幀可以極大地加速解碼過程，同時不會降低準確率。根據 CTC 的規則，我們會在兩個非空白幀之間至少保留一個空白幀，以避免錯誤地合併兩個連續相同的符號。我們建議將 cuda 束搜尋解碼器的 `blank_skip_threshold` 設定為 0.95。

與 flashlight CPU 解碼器的基準測試¶

我們使用 librispeech test_other 資料集對 CUDA 解碼器和 CPU 解碼器之間的吞吐量和準確性進行基準測試。要重現以下基準測試結果，您可以參考此處。	解碼器	設定	WER (%)	N-Best Oracle WER (%)
解碼耗時 (秒)	CUDA 解碼器	5.81	4.11	2.57
解碼耗時 (秒)	blank_skip_threshold 0.95	5.81	4.09	6.24
blank_skip_threshold 1.0 (無跳幀)	CPU 解碼器	5.86	4.30	28.61
blank_skip_threshold 1.0 (無跳幀)	beam_size_token 10	5.86	4.30	791.80

beam_size_token 500

從上表可以看出，CUDA 解碼器可以在 WER 方面略有改進，並在吞吐量方面顯著提高。

下載 Jupyter notebook: asr_inference_with_cuda_ctc_decoder_tutorial.ipynb