注意

點選此處下載完整的示例程式碼

使用 NVDEC 加速影片解碼¶

本教程展示瞭如何將 NVIDIA 的硬體影片解碼器 (NVDEC) 與 TorchAudio 一起使用，以及它如何提高影片解碼效能。

注意

本教程需要編譯時啟用硬體加速的 FFmpeg 庫。

請參閱啟用 GPU 影片解碼器/編碼器，瞭解如何構建支援硬體加速的 FFmpeg。

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

2.7.0
2.7.0

import os
import time

import matplotlib.pyplot as plt
from torchaudio.io import StreamReader

檢查先決條件¶

首先，我們檢查 TorchAudio 是否正確檢測到支援硬體解碼器/編碼器的 FFmpeg 庫。

from torchaudio.utils import ffmpeg_utils

print("FFmpeg Library versions:")
for k, ver in ffmpeg_utils.get_versions().items():
    print(f"  {k}:\t{'.'.join(str(v) for v in ver)}")

FFmpeg Library versions:
  libavcodec:   60.3.100
  libavdevice:  60.1.100
  libavfilter:  9.3.100
  libavformat:  60.3.100
  libavutil:    58.2.100

print("Available NVDEC Decoders:")
for k in ffmpeg_utils.get_video_decoders().keys():
    if "cuvid" in k:
        print(f" - {k}")

Available NVDEC Decoders:
 - av1_cuvid
 - h264_cuvid
 - hevc_cuvid
 - mjpeg_cuvid
 - mpeg1_cuvid
 - mpeg2_cuvid
 - mpeg4_cuvid
 - vc1_cuvid
 - vp8_cuvid
 - vp9_cuvid

print("Avaialbe GPU:")
print(torch.cuda.get_device_properties(0))

Avaialbe GPU:
_CudaDeviceProperties(name='NVIDIA A10G', major=8, minor=6, total_memory=22502MB, multi_processor_count=80, uuid=3a6a8555-efc9-d0dc-972b-36624af6fad8, L2_cache_size=6MB)

我們將使用具有以下屬性的影片；

編解碼器: H.264
解析度: 960x540
幀率: 29.97
畫素格式: YUV420P

src = torchaudio.utils.download_asset(
    "tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4"
)

  0%|          | 0.00/31.8M [00:00<?, ?B/s]
100%|##########| 31.8M/31.8M [00:00<00:00, 545MB/s]

使用 NVDEC 解碼影片¶

要使用硬體影片解碼器，您需要在定義輸出影片流時，透過將 decoder 選項傳遞給 add_video_stream() 方法來指定硬體解碼器。

s = StreamReader(src)
s.add_video_stream(5, decoder="h264_cuvid")
s.fill_buffer()
(video,) = s.pop_chunks()

影片幀將被解碼並以 NCHW 格式的張量返回。

print(video.shape, video.dtype)

torch.Size([5, 3, 540, 960]) torch.uint8

預設情況下，解碼後的幀會發送回 CPU 記憶體，並建立 CPU 張量。

print(video.device)

cpu

透過指定 hw_accel 選項，您可以將解碼後的幀轉換為 CUDA 張量。hw_accel 選項接受字串值並將其傳遞給 torch.device。

注意

目前，hw_accel 選項與 add_basic_video_stream() 不相容。add_basic_video_stream 添加了解碼後處理，該處理是為 CPU 記憶體中的幀設計的。請使用 add_video_stream()。

s = StreamReader(src)
s.add_video_stream(5, decoder="h264_cuvid", hw_accel="cuda:0")
s.fill_buffer()
(video,) = s.pop_chunks()

print(video.shape, video.dtype, video.device)

torch.Size([5, 3, 540, 960]) torch.uint8 cuda:0

注意

當有多個 GPU 可用時，StreamReader 預設使用第一個 GPU。您可以透過提供 "gpu" 選項來更改此設定。

# Video data is sent to CUDA device 0, decoded and
# converted on the same device.
s.add_video_stream(
    ...,
    decoder="h264_cuvid",
    decoder_option={"gpu": "0"},
    hw_accel="cuda:0",
)

注意

"gpu" 選項和 hw_accel 選項可以獨立指定。如果它們不匹配，解碼後的幀會自動傳輸到 hw_accell 指定的裝置。

# Video data is sent to CUDA device 0, and decoded there.
# Then it is transfered to CUDA device 1, and converted to
# CUDA tensor.
s.add_video_stream(
    ...,
    decoder="h264_cuvid",
    decoder_option={"gpu": "0"},
    hw_accel="cuda:1",
)

視覺化¶

讓我們看看透過硬體解碼器解碼的幀，並將其與軟體解碼器的等效結果進行比較。

以下函式會跳轉到給定時間戳，並使用指定的解碼器解碼一幀。

def test_decode(decoder: str, seek: float):
    s = StreamReader(src)
    s.seek(seek)
    s.add_video_stream(1, decoder=decoder)
    s.fill_buffer()
    (video,) = s.pop_chunks()
    return video[0]

timestamps = [12, 19, 45, 131, 180]

cpu_frames = [test_decode(decoder="h264", seek=ts) for ts in timestamps]
cuda_frames = [test_decode(decoder="h264_cuvid", seek=ts) for ts in timestamps]

注意

目前，硬體解碼器不支援色彩空間轉換。解碼後的幀是 YUV 格式。以下函式執行 YUV 到 RGB 轉換（並進行軸混洗以用於繪圖）。

def yuv_to_rgb(frames):
    frames = frames.cpu().to(torch.float)
    y = frames[..., 0, :, :]
    u = frames[..., 1, :, :]
    v = frames[..., 2, :, :]

    y /= 255
    u = u / 255 - 0.5
    v = v / 255 - 0.5

    r = y + 1.14 * v
    g = y + -0.396 * u - 0.581 * v
    b = y + 2.029 * u

    rgb = torch.stack([r, g, b], -1)
    rgb = (rgb * 255).clamp(0, 255).to(torch.uint8)
    return rgb.numpy()

現在我們視覺化結果。

def plot():
    n_rows = len(timestamps)
    fig, axes = plt.subplots(n_rows, 2, figsize=[12.8, 16.0])
    for i in range(n_rows):
        axes[i][0].imshow(yuv_to_rgb(cpu_frames[i]))
        axes[i][1].imshow(yuv_to_rgb(cuda_frames[i]))

    axes[0][0].set_title("Software decoder")
    axes[0][1].set_title("HW decoder")
    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()


plot()

在作者看來，它們是無法區分的。如果您發現任何不同之處，請隨時告知我們。 :)

硬體縮放和裁剪¶

您可以使用 decoder_option 引數來提供特定於解碼器的選項。

以下選項在預處理中通常是相關的。

resize: 將幀縮放到 (width)x(height)。
crop: 裁剪幀 (top)x(bottom)x(left)x(right)。請注意，指定的值是移除的行/列數量。最終影像大小為 (width - left - right)x(height - top -bottom)。如果同時使用 crop 和 resize 選項，將首先執行 crop。

有關其他可用選項，請執行 ffmpeg -h decoder=h264_cuvid。

def test_options(option):
    s = StreamReader(src)
    s.seek(87)
    s.add_video_stream(1, decoder="h264_cuvid", hw_accel="cuda:0", decoder_option=option)
    s.fill_buffer()
    (video,) = s.pop_chunks()
    print(f"Option: {option}:\t{video.shape}")
    return video[0]

original = test_options(option=None)
resized = test_options(option={"resize": "480x270"})
cropped = test_options(option={"crop": "135x135x240x240"})
cropped_and_resized = test_options(option={"crop": "135x135x240x240", "resize": "640x360"})

Option: None:   torch.Size([1, 3, 540, 960])
Option: {'resize': '480x270'}:  torch.Size([1, 3, 270, 480])
Option: {'crop': '135x135x240x240'}:    torch.Size([1, 3, 270, 480])
Option: {'crop': '135x135x240x240', 'resize': '640x360'}:       torch.Size([1, 3, 360, 640])

def plot():
    fig, axes = plt.subplots(2, 2, figsize=[12.8, 9.6])
    axes[0][0].imshow(yuv_to_rgb(original))
    axes[0][1].imshow(yuv_to_rgb(resized))
    axes[1][0].imshow(yuv_to_rgb(cropped))
    axes[1][1].imshow(yuv_to_rgb(cropped_and_resized))

    axes[0][0].set_title("Original")
    axes[0][1].set_title("Resized")
    axes[1][0].set_title("Cropped")
    axes[1][1].set_title("Cropped and resized")
    plt.tight_layout()
    return fig


plot()

Original, Resized, Cropped, Cropped and resized

<Figure size 1280x960 with 4 Axes>

比較縮放方法¶

與軟體縮放不同，NVDEC 不提供選擇縮放演算法的選項。在 ML 應用中，通常需要構建具有相似數值特性的預處理管道。因此，在這裡我們比較硬體縮放與不同演算法的軟體縮放結果。

我們將使用以下影片，其中包含使用以下命令生成的測試模式。

ffmpeg -y -f lavfi -t 12.05 -i mptestsrc -movflags +faststart mptestsrc.mp4

test_src = torchaudio.utils.download_asset("tutorial-assets/mptestsrc.mp4")

  0%|          | 0.00/232k [00:00<?, ?B/s]
100%|##########| 232k/232k [00:00<00:00, 41.6MB/s]

以下函式解碼影片並應用指定的縮放演算法。

def decode_resize_ffmpeg(mode, height, width, seek):
    filter_desc = None if mode is None else f"scale={width}:{height}:sws_flags={mode}"
    s = StreamReader(test_src)
    s.add_video_stream(1, filter_desc=filter_desc)
    s.seek(seek)
    s.fill_buffer()
    (chunk,) = s.pop_chunks()
    return chunk

以下函式使用硬體解碼器解碼影片並縮放。

def decode_resize_cuvid(height, width, seek):
    s = StreamReader(test_src)
    s.add_video_stream(1, decoder="h264_cuvid", decoder_option={"resize": f"{width}x{height}"}, hw_accel="cuda:0")
    s.seek(seek)
    s.fill_buffer()
    (chunk,) = s.pop_chunks()
    return chunk.cpu()

現在我們執行它們並可視化結果幀。

params = {"height": 224, "width": 224, "seek": 3}

frames = [
    decode_resize_ffmpeg(None, **params),
    decode_resize_ffmpeg("neighbor", **params),
    decode_resize_ffmpeg("bilinear", **params),
    decode_resize_ffmpeg("bicubic", **params),
    decode_resize_cuvid(**params),
    decode_resize_ffmpeg("spline", **params),
    decode_resize_ffmpeg("lanczos:param0=1", **params),
    decode_resize_ffmpeg("lanczos:param0=3", **params),
    decode_resize_ffmpeg("lanczos:param0=5", **params),
]

def plot():
    fig, axes = plt.subplots(3, 3, figsize=[12.8, 15.2])
    for i, f in enumerate(frames):
        h, w = f.shape[2:4]
        f = f[..., : h // 4, : w // 4]
        axes[i // 3][i % 3].imshow(yuv_to_rgb(f[0]))
    axes[0][0].set_title("Original")
    axes[0][1].set_title("nearest neighbor")
    axes[0][2].set_title("bilinear")
    axes[1][0].set_title("bicubic")
    axes[1][1].set_title("NVDEC")
    axes[1][2].set_title("spline")
    axes[2][0].set_title("lanczos(1)")
    axes[2][1].set_title("lanczos(3)")
    axes[2][2].set_title("lanczos(5)")

    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()


plot()

Original, nearest neighbor, bilinear, bicubic, NVDEC, spline, lanczos(1), lanczos(3), lanczos(5)

它們都不完全相同。在作者看來，lanczos(1) 似乎與 NVDEC 最相似。雙三次插值看起來也很接近。

使用 StreamReader 效能測試 NVDEC¶

在本節中，我們比較軟體影片解碼和硬體影片解碼的效能。

解碼為 CUDA 幀¶

首先，我們比較軟體解碼器和硬體解碼器解碼同一影片所需的時間。為了使結果具有可比性，當使用軟體解碼器時，我們將結果張量移動到 CUDA。

測試過程如下

使用硬體解碼器並將資料直接放在 CUDA 上
使用軟體解碼器，生成 CPU 張量並將其移動到 CUDA。

以下函式實現了硬體解碼器測試用例。

def test_decode_cuda(src, decoder, hw_accel="cuda", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder=decoder, hw_accel=hw_accel)

    num_frames = 0
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
    elapsed = time.monotonic() - t0
    print(f" - Shape: {chunk.shape}")
    fps = num_frames / elapsed
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了軟體解碼器測試用例。

def test_decode_cpu(src, threads, decoder=None, frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder=decoder, decoder_option={"threads": f"{threads}"})

    num_frames = 0
    device = torch.device("cuda")
    t0 = time.monotonic()
    for i, (chunk,) in enumerate(s.stream()):
        if i == 0:
            print(f" - Shape: {chunk.shape}")
        num_frames += chunk.shape[0]
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

對於每種影片解析度，我們執行多個不同執行緒數的軟體解碼器測試用例。

def run_decode_tests(src, frames_per_chunk=5):
    fps = []
    print(f"Testing: {os.path.basename(src)}")
    for threads in [1, 4, 8, 16]:
        print(f"* Software decoding (num_threads={threads})")
        fps.append(test_decode_cpu(src, threads))
    print("* Hardware decoding")
    fps.append(test_decode_cuda(src, decoder="h264_cuvid"))
    return fps

現在我們使用不同解析度的影片執行測試。

QVGA¶

src_qvga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_qvga.h264.mp4")
fps_qvga = run_decode_tests(src_qvga)

  0%|          | 0.00/1.06M [00:00<?, ?B/s]
100%|##########| 1.06M/1.06M [00:00<00:00, 147MB/s]
Testing: testsrc2_qvga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.50 seconds. (1814.82 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.34 seconds. (2679.88 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.34 seconds. (2674.27 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 895 frames in 0.43 seconds. (2088.70 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 2.01 seconds. (447.36 fps)

VGA¶

src_vga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_vga.h264.mp4")
fps_vga = run_decode_tests(src_vga)

  0%|          | 0.00/3.59M [00:00<?, ?B/s]
 59%|#####9    | 2.12M/3.59M [00:00<00:00, 10.0MB/s]
100%|##########| 3.59M/3.59M [00:00<00:00, 16.3MB/s]
Testing: testsrc2_vga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 1.20 seconds. (749.76 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.71 seconds. (1274.24 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.70 seconds. (1285.18 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 895 frames in 0.64 seconds. (1402.77 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.34 seconds. (2639.80 fps)

XGA¶

src_xga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_xga.h264.mp4")
fps_xga = run_decode_tests(src_xga)

  0%|          | 0.00/9.22M [00:00<?, ?B/s]
 98%|#########7| 9.00M/9.22M [00:00<00:00, 35.8MB/s]
100%|##########| 9.22M/9.22M [00:00<00:00, 36.4MB/s]
Testing: testsrc2_xga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 2.70 seconds. (333.73 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 1.38 seconds. (652.84 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 1.28 seconds. (703.55 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 895 frames in 1.30 seconds. (690.26 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 0.61 seconds. (1473.92 fps)

結果¶

現在我們繪製結果。

def plot():
    fig, ax = plt.subplots(figsize=[9.6, 6.4])

    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
        ax.plot(items[:-1], marker=items[-1])
    ax.grid(axis="both")
    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
    ax.legend(
        [
            "Software Decoding (threads=1)",
            "Software Decoding (threads=4)",
            "Software Decoding (threads=8)",
            "Software Decoding (threads=16)",
            "Hardware Decoding (CUDA Tensor)",
        ]
    )
    ax.set_title("Speed of processing video frames")
    ax.set_ylabel("Frames per second")
    plt.tight_layout()


plot()

我們觀察到以下幾點

增加軟體解碼的執行緒數可以加快管道速度，但效能在大約 8 個執行緒時趨於飽和。
使用硬體解碼器帶來的效能提升取決於影片的解析度。
在 QVGA 等較低解析度下，硬體解碼慢於軟體解碼
在 XGA 等較高解析度下，硬體解碼快於軟體解碼。

值得注意的是，效能提升還取決於 GPU 的型別。我們觀察到，當使用 V100 或 A100 GPU 解碼 VGA 影片時，硬體解碼器比軟體解碼器慢。但使用 A10 GPU 硬體解碼器則比軟體解碼器快。

解碼並縮放¶

接下來，我們向管道中新增縮放操作。我們將比較以下管道。

使用軟體解碼器解碼影片，並將幀作為 PyTorch 張量讀取。使用 torch.nn.functional.interpolate() 縮放張量，然後將結果張量傳送到 CUDA 裝置。
使用軟體解碼器解碼影片，使用 FFmpeg 的 filter graph 縮放幀，將縮放後的幀作為 PyTorch 張量讀取，然後將其傳送到 CUDA 裝置。
使用硬體解碼器同時解碼和縮放影片，將結果幀作為 CUDA 張量讀取。

管道 1 代表常見的影片載入實現。

管道 2 使用 FFmpeg 的 filter graph，它允許在將原始幀轉換為張量之前對其進行處理。

管道 3 將從 CPU 到 CUDA 的資料傳輸量降至最低，這顯著有助於實現高效能資料載入。

以下函式實現了管道 1。它使用 PyTorch 的 torch.nn.functional.interpolate()。我們使用 bincubic 模式，因為我們看到結果幀與 NVDEC 縮放最接近。

def test_decode_then_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder_option={"threads": "8"})

    num_frames = 0
    device = torch.device("cuda")
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
        chunk = torch.nn.functional.interpolate(chunk, [height, width], mode=mode, antialias=True)
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了管道 2。幀作為解碼過程的一部分進行縮放，然後傳送到 CUDA 裝置。

我們使用 bincubic 模式，以使結果與上面的基於 PyTorch 的實現具有可比性。

def test_decode_and_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(
        frames_per_chunk, filter_desc=f"scale={width}:{height}:sws_flags={mode}", decoder_option={"threads": "8"}
    )

    num_frames = 0
    device = torch.device("cuda")
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了管道 3。縮放由 NVDEC 執行，結果張量放置在 CUDA 記憶體中。

def test_hw_decode_and_resize(src, decoder, decoder_option, hw_accel="cuda", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(5, decoder=decoder, decoder_option=decoder_option, hw_accel=hw_accel)

    num_frames = 0
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式對給定源執行效能測試函式。

def run_resize_tests(src):
    print(f"Testing: {os.path.basename(src)}")
    height, width = 224, 224
    print("* Software decoding with PyTorch interpolate")
    cpu_resize1 = test_decode_then_resize(src, height=height, width=width)
    print("* Software decoding with FFmpeg scale")
    cpu_resize2 = test_decode_and_resize(src, height=height, width=width)
    print("* Hardware decoding with resize")
    cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": f"{width}x{height}"})
    return [cpu_resize1, cpu_resize2, cuda_resize]

現在我們執行測試。

QVGA¶

fps_qvga = run_resize_tests(src_qvga)

Testing: testsrc2_qvga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.61 seconds. (1486.29 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.40 seconds. (2229.01 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 2.02 seconds. (444.56 fps)

VGA¶

fps_vga = run_resize_tests(src_vga)

Testing: testsrc2_vga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 1.45 seconds. (620.26 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.69 seconds. (1300.24 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.34 seconds. (2653.73 fps)

XGA¶

fps_xga = run_resize_tests(src_xga)

Testing: testsrc2_xga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 2.69 seconds. (334.90 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 1.06 seconds. (850.30 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.61 seconds. (1476.55 fps)

結果¶

現在我們繪製結果。

def plot():
    fig, ax = plt.subplots(figsize=[9.6, 6.4])

    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
        ax.plot(items[:-1], marker=items[-1])
    ax.grid(axis="both")
    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
    ax.legend(
        [
            "Software decoding\nwith resize\n(PyTorch interpolate)",
            "Software decoding\nwith resize\n(FFmpeg scale)",
            "NVDEC\nwith resizing",
        ]
    )
    ax.set_title("Speed of processing video frames")
    ax.set_xlabel("Input video resolution")
    ax.set_ylabel("Frames per second")
    plt.tight_layout()


plot()

硬體解碼器顯示出與之前實驗相似的趨勢。事實上，效能幾乎相同。硬體縮放對幀進行縮小几乎沒有開銷。

軟體解碼也顯示出相似的趨勢。將縮放作為解碼過程的一部分執行速度更快。一種可能的解釋是，影片幀內部儲存為 YUV420P，其畫素數是 RGB24 或 YUV444P 的一半。這意味著如果在將幀資料複製到 PyTorch 張量之前進行縮放，則操作和複製的畫素數量小於在幀轉換為張量後應用縮放的情況。

標籤: torchaudio.io

指令碼總執行時間： ( 0 分 31.872 秒)

Sphinx-Gallery 生成的相簿

使用 NVDEC 加速影片解碼¶

檢查先決條件¶

使用 NVDEC 解碼影片¶

視覺化¶

硬體縮放和裁剪¶

比較縮放方法¶

使用 StreamReader 效能測試 NVDEC¶

解碼為 CUDA 幀¶

QVGA¶

VGA¶

XGA¶

結果¶

解碼並縮放¶

QVGA¶

VGA¶

XGA¶

結果¶

文件

教程

資源