Vad¶

class torchaudio.transforms.Vad(sample_rate: int, trigger_level: float = 7.0, trigger_time: float = 0.25, search_time: float = 1.0, allowed_gap: float = 0.25, pre_trigger_time: float = 0.0, boot_time: float = 0.35, noise_up_time: float = 0.1, noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35, measure_freq: float = 20.0, measure_duration: Optional[float] = None, measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0, lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0, lp_lifter_freq: float = 2000.0)[source]¶

語音活動檢測器 (Voice Activity Detector)。類似於 SoX 實現。

嘗試從語音錄音的兩端修剪靜音和輕微的背景聲音。該演算法目前使用簡單的倒譜功率測量來檢測語音，因此可能會被其他聲音（尤其是音樂）誤導。

該效果只能從音訊的前端進行修剪，因此若要從後端修剪，必須同時使用反向效果。

引數:

sample_rate (int) – 音訊訊號的取樣率。
trigger_level (float, optional) – 用於觸發活動檢測的測量級別。這可能需要根據輸入音訊的噪聲水平、訊號水平和其它特性進行更改。(預設值: 7.0)
trigger_time (float, optional) – 用於幫助忽略短促聲音爆發的時間常數（以秒為單位）。(預設值: 0.25)
search_time (float, optional) – 在檢測到的觸發點之前，搜尋更安靜/更短的聲音爆發以包含進來的音訊量（以秒為單位）。(預設值: 1.0)
allowed_gap (float, optional) – 在檢測到的觸發點之前，要包含的更安靜/更短的聲音爆發之間允許的間隔（以秒為單位）。(預設值: 0.25)
pre_trigger_time (float, optional) – 在觸發點和任何發現的更安靜/更短的聲音爆發之前，保留的音訊量（以秒為單位）。(預設值: 0.0)
boot_time (float, optional) 該演算法（Python 內部）——為了檢測所需音訊的開始而進行的估計/降噪。此選項設定初始噪聲估計的時間。(預設值: 0.35)
noise_up_time (float, optional) – 用於處理噪聲水平增加的情況。(預設值: 0.1)
noise_down_time (float, optional) – 用於處理噪聲水平降低的情況。(預設值: 0.01)
noise_reduction_amount (float, optional) – 檢測演算法中用於減少噪聲估計的量。(預設值: 1.35) (Note: Corrected translation based on SoX doc)
measure_freq (float, optional) – 處理/測量的頻率。(預設值: 20.0)
measure_duration – (float or None, optional) 測量持續時間。(預設值: 測量週期的兩倍；即存在重疊。)
measure_smooth_time (float, optional) – 用於平滑頻譜測量的時間常數。(預設值: 0.4)
hp_filter_freq (float, optional) – 檢測演算法輸入端的高通濾波器頻率。(預設值: 50.0)
lp_filter_freq (float, optional) – 檢測演算法輸入端的低通濾波器頻率。(預設值: 6000.0)
hp_lifter_freq (float, optional) – 檢測演算法中的高通提升濾波器頻率。(預設值: 150.0)
lp_lifter_freq (float, optional) – 檢測演算法中的低通提升濾波器頻率。(預設值: 2000.0)

示例

>>> waveform, sample_rate = torchaudio.load("test.wav", normalize=True)
>>> waveform_reversed, sample_rate = apply_effects_tensor(waveform, sample_rate, [["reverse"]])
>>> transform = transforms.Vad(sample_rate=sample_rate, trigger_level=7.5)
>>> waveform_reversed_front_trim = transform(waveform_reversed)
>>> waveform_end_trim, sample_rate = apply_effects_tensor(
>>>     waveform_reversed_front_trim, sample_rate, [["reverse"]]
>>> )

參考資料

http://sox.sourceforge.net/sox.html

forward(waveform: Tensor) → Tensor[source]¶

引數:: waveform (Tensor) – 維度為 (channels, time) 或 (time) 的音訊 Tensor。形狀為 (channels, time) 的 Tensor 被視為同一事件的多通道錄音，最終輸出將修剪到任意通道中最早的語音活動位置。

Vad¶

文件

教程

資源