HiFi GAN

模型描述

本筆記本演示了論文中描述的 HiFi-GAN 模型的 PyTorch 實現： HiFi-GAN：用於高效高保真語音合成的生成對抗網路。HiFi-GAN 模型實現了一個頻譜圖反演模型，可以從梅爾頻譜圖合成語音波形。它遵循生成對抗網路 (GAN) 正規化，由一個生成器和一個判別器組成。訓練後，生成器用於合成，判別器被丟棄。

我們的實現基於論文作者釋出的版本。我們修改了原始超引數並提供了一個替代的訓練方案，該方案支援更大批次訓練和更快的收斂。HiFi-GAN 在公開的LJ Speech 資料集上進行訓練。這些樣本演示了使用我們公開的 FastPitch 和 HiFi-GAN 檢查點合成的語音。

模型架構

示例

在下面的示例中

預訓練的 FastPitch 和 HiFiGAN 模型從 torch.hub 載入
給定輸入文字（“Say this smoothly to prove you are not a robot.”）的張量表示，FastPitch 生成梅爾頻譜圖
HiFiGAN 根據梅爾頻譜圖生成聲音
輸出聲音儲存為“audio.wav”檔案

要執行此示例，您需要安裝一些額外的 python 包。這些包用於文字和音訊的預處理，以及顯示和輸入/輸出處理。最後，為了 FastPitch 模型的更好效能，我們下載了 CMU 發音詞典。

pip install numpy scipy librosa unidecode inflect librosa matplotlib==3.6.3
apt-get update
apt-get install -y libsndfile1 wget
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/heteronyms-052722 -qO heteronyms
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/cmudict-0.7b_nv22.08 -qO cmudict-0.7b

import torch
import matplotlib.pyplot as plt
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f'Using {device} for inference')

下載並設定 FastPitch 生成器模型。

fastpitch, generator_train_setup = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_fastpitch')

下載並設定聲碼器和去噪器模型。

hifigan, vocoder_train_setup, denoiser = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_hifigan')

驗證生成器和聲碼器模型在輸入引數上是否一致。

CHECKPOINT_SPECIFIC_ARGS = [
    'sampling_rate', 'hop_length', 'win_length', 'p_arpabet', 'text_cleaners',
    'symbol_set', 'max_wav_value', 'prepend_space_to_text',
    'append_space_to_text']

for k in CHECKPOINT_SPECIFIC_ARGS:

    v1 = generator_train_setup.get(k, None)
    v2 = vocoder_train_setup.get(k, None)

    assert v1 is None or v2 is None or v1 == v2, \
        f'{k} mismatch in spectrogram generator and vocoder'

將所有模型放在可用裝置上。

fastpitch.to(device)
hifigan.to(device)
denoiser.to(device)

載入文字處理器。

tp = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_textprocessing_utils', cmudict_path="cmudict-0.7b", heteronyms_path="heteronyms")

設定要合成的文字，準備輸入並設定額外的生成引數。

text = "Say this smoothly, to prove you are not a robot."

batches = tp.prepare_input_sequence([text], batch_size=1)

gen_kw = {'pace': 1.0,
          'speaker': 0,
          'pitch_tgt': None,
          'pitch_transform': None}
denoising_strength = 0.005

for batch in batches:
    with torch.no_grad():
        mel, mel_lens, *_ = fastpitch(batch['text'].to(device), **gen_kw)
        audios = hifigan(mel).float()
        audios = denoiser(audios.squeeze(1), denoising_strength)
        audios = audios.squeeze(1) * vocoder_train_setup['max_wav_value']

繪製中間頻譜圖。

plt.figure(figsize=(10,12))
res_mel = mel[0].detach().cpu().numpy()
plt.imshow(res_mel, origin='lower')
plt.xlabel('time')
plt.ylabel('frequency')
_=plt.title('Spectrogram')

合成音訊。

audio_numpy = audios[0].cpu().numpy()
Audio(audio_numpy, rate=22050)

將音訊寫入 wav 檔案。

from scipy.io.wavfile import write
write("audio.wav", vocoder_train_setup['sampling_rate'], audio_numpy)

詳情

有關模型輸入和輸出、訓練方案、推理和效能的詳細資訊，請訪問： github 和/或 NGC

參考文獻

用於從梅爾頻譜圖生成波形的 HiFi GAN 模型

模型型別： 音訊

提交者： NVIDIA

在 GitHub 上檢視 14.5k

在Google Collab上開啟