• 文件 >
  • 使用 NVDEC 加速影片解碼 >
  • 舊版本(穩定版)
快捷方式

使用 NVDEC 加速影片解碼

作者: Moto Hira

本教程展示瞭如何將 NVIDIA 的硬體影片解碼器 (NVDEC) 與 TorchAudio 一起使用,以及它如何提高影片解碼效能。

注意

本教程需要編譯時啟用硬體加速的 FFmpeg 庫。

請參閱啟用 GPU 影片解碼器/編碼器,瞭解如何構建支援硬體加速的 FFmpeg。

import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)
2.7.0
2.7.0
import os
import time

import matplotlib.pyplot as plt
from torchaudio.io import StreamReader

檢查先決條件

首先,我們檢查 TorchAudio 是否正確檢測到支援硬體解碼器/編碼器的 FFmpeg 庫。

from torchaudio.utils import ffmpeg_utils
print("FFmpeg Library versions:")
for k, ver in ffmpeg_utils.get_versions().items():
    print(f"  {k}:\t{'.'.join(str(v) for v in ver)}")
FFmpeg Library versions:
  libavcodec:   60.3.100
  libavdevice:  60.1.100
  libavfilter:  9.3.100
  libavformat:  60.3.100
  libavutil:    58.2.100
print("Available NVDEC Decoders:")
for k in ffmpeg_utils.get_video_decoders().keys():
    if "cuvid" in k:
        print(f" - {k}")
Available NVDEC Decoders:
 - av1_cuvid
 - h264_cuvid
 - hevc_cuvid
 - mjpeg_cuvid
 - mpeg1_cuvid
 - mpeg2_cuvid
 - mpeg4_cuvid
 - vc1_cuvid
 - vp8_cuvid
 - vp9_cuvid
print("Avaialbe GPU:")
print(torch.cuda.get_device_properties(0))
Avaialbe GPU:
_CudaDeviceProperties(name='NVIDIA A10G', major=8, minor=6, total_memory=22502MB, multi_processor_count=80, uuid=3a6a8555-efc9-d0dc-972b-36624af6fad8, L2_cache_size=6MB)

我們將使用具有以下屬性的影片;

  • 編解碼器: H.264

  • 解析度: 960x540

  • 幀率: 29.97

  • 畫素格式: YUV420P

src = torchaudio.utils.download_asset(
    "tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4"
)
  0%|          | 0.00/31.8M [00:00<?, ?B/s]
100%|##########| 31.8M/31.8M [00:00<00:00, 545MB/s]

使用 NVDEC 解碼影片

要使用硬體影片解碼器,您需要在定義輸出影片流時,透過將 decoder 選項傳遞給 add_video_stream() 方法來指定硬體解碼器。

s = StreamReader(src)
s.add_video_stream(5, decoder="h264_cuvid")
s.fill_buffer()
(video,) = s.pop_chunks()

影片幀將被解碼並以 NCHW 格式的張量返回。

torch.Size([5, 3, 540, 960]) torch.uint8

預設情況下,解碼後的幀會發送回 CPU 記憶體,並建立 CPU 張量。

print(video.device)
cpu

透過指定 hw_accel 選項,您可以將解碼後的幀轉換為 CUDA 張量。hw_accel 選項接受字串值並將其傳遞給 torch.device

注意

目前,hw_accel 選項與 add_basic_video_stream() 不相容。add_basic_video_stream 添加了解碼後處理,該處理是為 CPU 記憶體中的幀設計的。請使用 add_video_stream()

s = StreamReader(src)
s.add_video_stream(5, decoder="h264_cuvid", hw_accel="cuda:0")
s.fill_buffer()
(video,) = s.pop_chunks()

print(video.shape, video.dtype, video.device)
torch.Size([5, 3, 540, 960]) torch.uint8 cuda:0

注意

當有多個 GPU 可用時,StreamReader 預設使用第一個 GPU。您可以透過提供 "gpu" 選項來更改此設定。

# Video data is sent to CUDA device 0, decoded and
# converted on the same device.
s.add_video_stream(
    ...,
    decoder="h264_cuvid",
    decoder_option={"gpu": "0"},
    hw_accel="cuda:0",
)

注意

"gpu" 選項和 hw_accel 選項可以獨立指定。如果它們不匹配,解碼後的幀會自動傳輸到 hw_accell 指定的裝置。

# Video data is sent to CUDA device 0, and decoded there.
# Then it is transfered to CUDA device 1, and converted to
# CUDA tensor.
s.add_video_stream(
    ...,
    decoder="h264_cuvid",
    decoder_option={"gpu": "0"},
    hw_accel="cuda:1",
)

視覺化

讓我們看看透過硬體解碼器解碼的幀,並將其與軟體解碼器的等效結果進行比較。

以下函式會跳轉到給定時間戳,並使用指定的解碼器解碼一幀。

def test_decode(decoder: str, seek: float):
    s = StreamReader(src)
    s.seek(seek)
    s.add_video_stream(1, decoder=decoder)
    s.fill_buffer()
    (video,) = s.pop_chunks()
    return video[0]
timestamps = [12, 19, 45, 131, 180]

cpu_frames = [test_decode(decoder="h264", seek=ts) for ts in timestamps]
cuda_frames = [test_decode(decoder="h264_cuvid", seek=ts) for ts in timestamps]

注意

目前,硬體解碼器不支援色彩空間轉換。解碼後的幀是 YUV 格式。以下函式執行 YUV 到 RGB 轉換(並進行軸混洗以用於繪圖)。

def yuv_to_rgb(frames):
    frames = frames.cpu().to(torch.float)
    y = frames[..., 0, :, :]
    u = frames[..., 1, :, :]
    v = frames[..., 2, :, :]

    y /= 255
    u = u / 255 - 0.5
    v = v / 255 - 0.5

    r = y + 1.14 * v
    g = y + -0.396 * u - 0.581 * v
    b = y + 2.029 * u

    rgb = torch.stack([r, g, b], -1)
    rgb = (rgb * 255).clamp(0, 255).to(torch.uint8)
    return rgb.numpy()

現在我們視覺化結果。

def plot():
    n_rows = len(timestamps)
    fig, axes = plt.subplots(n_rows, 2, figsize=[12.8, 16.0])
    for i in range(n_rows):
        axes[i][0].imshow(yuv_to_rgb(cpu_frames[i]))
        axes[i][1].imshow(yuv_to_rgb(cuda_frames[i]))

    axes[0][0].set_title("Software decoder")
    axes[0][1].set_title("HW decoder")
    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()


plot()
Software decoder, HW decoder

在作者看來,它們是無法區分的。如果您發現任何不同之處,請隨時告知我們。 :)

硬體縮放和裁剪

您可以使用 decoder_option 引數來提供特定於解碼器的選項。

以下選項在預處理中通常是相關的。

  • resize: 將幀縮放到 (width)x(height)

  • crop: 裁剪幀 (top)x(bottom)x(left)x(right)。請注意,指定的值是移除的行/列數量。最終影像大小為 (width - left - right)x(height - top -bottom)。如果同時使用 cropresize 選項,將首先執行 crop

有關其他可用選項,請執行 ffmpeg -h decoder=h264_cuvid

def test_options(option):
    s = StreamReader(src)
    s.seek(87)
    s.add_video_stream(1, decoder="h264_cuvid", hw_accel="cuda:0", decoder_option=option)
    s.fill_buffer()
    (video,) = s.pop_chunks()
    print(f"Option: {option}:\t{video.shape}")
    return video[0]
original = test_options(option=None)
resized = test_options(option={"resize": "480x270"})
cropped = test_options(option={"crop": "135x135x240x240"})
cropped_and_resized = test_options(option={"crop": "135x135x240x240", "resize": "640x360"})
Option: None:   torch.Size([1, 3, 540, 960])
Option: {'resize': '480x270'}:  torch.Size([1, 3, 270, 480])
Option: {'crop': '135x135x240x240'}:    torch.Size([1, 3, 270, 480])
Option: {'crop': '135x135x240x240', 'resize': '640x360'}:       torch.Size([1, 3, 360, 640])
def plot():
    fig, axes = plt.subplots(2, 2, figsize=[12.8, 9.6])
    axes[0][0].imshow(yuv_to_rgb(original))
    axes[0][1].imshow(yuv_to_rgb(resized))
    axes[1][0].imshow(yuv_to_rgb(cropped))
    axes[1][1].imshow(yuv_to_rgb(cropped_and_resized))

    axes[0][0].set_title("Original")
    axes[0][1].set_title("Resized")
    axes[1][0].set_title("Cropped")
    axes[1][1].set_title("Cropped and resized")
    plt.tight_layout()
    return fig


plot()
Original, Resized, Cropped, Cropped and resized
<Figure size 1280x960 with 4 Axes>

比較縮放方法

與軟體縮放不同,NVDEC 不提供選擇縮放演算法的選項。在 ML 應用中,通常需要構建具有相似數值特性的預處理管道。因此,在這裡我們比較硬體縮放與不同演算法的軟體縮放結果。

我們將使用以下影片,其中包含使用以下命令生成的測試模式。

ffmpeg -y -f lavfi -t 12.05 -i mptestsrc -movflags +faststart mptestsrc.mp4
test_src = torchaudio.utils.download_asset("tutorial-assets/mptestsrc.mp4")
  0%|          | 0.00/232k [00:00<?, ?B/s]
100%|##########| 232k/232k [00:00<00:00, 41.6MB/s]

以下函式解碼影片並應用指定的縮放演算法。

def decode_resize_ffmpeg(mode, height, width, seek):
    filter_desc = None if mode is None else f"scale={width}:{height}:sws_flags={mode}"
    s = StreamReader(test_src)
    s.add_video_stream(1, filter_desc=filter_desc)
    s.seek(seek)
    s.fill_buffer()
    (chunk,) = s.pop_chunks()
    return chunk

以下函式使用硬體解碼器解碼影片並縮放。

def decode_resize_cuvid(height, width, seek):
    s = StreamReader(test_src)
    s.add_video_stream(1, decoder="h264_cuvid", decoder_option={"resize": f"{width}x{height}"}, hw_accel="cuda:0")
    s.seek(seek)
    s.fill_buffer()
    (chunk,) = s.pop_chunks()
    return chunk.cpu()

現在我們執行它們並可視化結果幀。

params = {"height": 224, "width": 224, "seek": 3}

frames = [
    decode_resize_ffmpeg(None, **params),
    decode_resize_ffmpeg("neighbor", **params),
    decode_resize_ffmpeg("bilinear", **params),
    decode_resize_ffmpeg("bicubic", **params),
    decode_resize_cuvid(**params),
    decode_resize_ffmpeg("spline", **params),
    decode_resize_ffmpeg("lanczos:param0=1", **params),
    decode_resize_ffmpeg("lanczos:param0=3", **params),
    decode_resize_ffmpeg("lanczos:param0=5", **params),
]
def plot():
    fig, axes = plt.subplots(3, 3, figsize=[12.8, 15.2])
    for i, f in enumerate(frames):
        h, w = f.shape[2:4]
        f = f[..., : h // 4, : w // 4]
        axes[i // 3][i % 3].imshow(yuv_to_rgb(f[0]))
    axes[0][0].set_title("Original")
    axes[0][1].set_title("nearest neighbor")
    axes[0][2].set_title("bilinear")
    axes[1][0].set_title("bicubic")
    axes[1][1].set_title("NVDEC")
    axes[1][2].set_title("spline")
    axes[2][0].set_title("lanczos(1)")
    axes[2][1].set_title("lanczos(3)")
    axes[2][2].set_title("lanczos(5)")

    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()


plot()
Original, nearest neighbor, bilinear, bicubic, NVDEC, spline, lanczos(1), lanczos(3), lanczos(5)

它們都不完全相同。在作者看來,lanczos(1) 似乎與 NVDEC 最相似。雙三次插值看起來也很接近。

使用 StreamReader 效能測試 NVDEC

在本節中,我們比較軟體影片解碼和硬體影片解碼的效能。

解碼為 CUDA 幀

首先,我們比較軟體解碼器和硬體解碼器解碼同一影片所需的時間。為了使結果具有可比性,當使用軟體解碼器時,我們將結果張量移動到 CUDA。

測試過程如下

  • 使用硬體解碼器並將資料直接放在 CUDA 上

  • 使用軟體解碼器,生成 CPU 張量並將其移動到 CUDA。

以下函式實現了硬體解碼器測試用例。

def test_decode_cuda(src, decoder, hw_accel="cuda", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder=decoder, hw_accel=hw_accel)

    num_frames = 0
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
    elapsed = time.monotonic() - t0
    print(f" - Shape: {chunk.shape}")
    fps = num_frames / elapsed
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了軟體解碼器測試用例。

def test_decode_cpu(src, threads, decoder=None, frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder=decoder, decoder_option={"threads": f"{threads}"})

    num_frames = 0
    device = torch.device("cuda")
    t0 = time.monotonic()
    for i, (chunk,) in enumerate(s.stream()):
        if i == 0:
            print(f" - Shape: {chunk.shape}")
        num_frames += chunk.shape[0]
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

對於每種影片解析度,我們執行多個不同執行緒數的軟體解碼器測試用例。

def run_decode_tests(src, frames_per_chunk=5):
    fps = []
    print(f"Testing: {os.path.basename(src)}")
    for threads in [1, 4, 8, 16]:
        print(f"* Software decoding (num_threads={threads})")
        fps.append(test_decode_cpu(src, threads))
    print("* Hardware decoding")
    fps.append(test_decode_cuda(src, decoder="h264_cuvid"))
    return fps

現在我們使用不同解析度的影片執行測試。

QVGA

src_qvga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_qvga.h264.mp4")
fps_qvga = run_decode_tests(src_qvga)
  0%|          | 0.00/1.06M [00:00<?, ?B/s]
100%|##########| 1.06M/1.06M [00:00<00:00, 147MB/s]
Testing: testsrc2_qvga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.50 seconds. (1814.82 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.34 seconds. (2679.88 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 0.34 seconds. (2674.27 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 895 frames in 0.43 seconds. (2088.70 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 240, 320])
 - Processed 900 frames in 2.01 seconds. (447.36 fps)

VGA

src_vga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_vga.h264.mp4")
fps_vga = run_decode_tests(src_vga)
  0%|          | 0.00/3.59M [00:00<?, ?B/s]
 59%|#####9    | 2.12M/3.59M [00:00<00:00, 10.0MB/s]
100%|##########| 3.59M/3.59M [00:00<00:00, 16.3MB/s]
Testing: testsrc2_vga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 1.20 seconds. (749.76 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.71 seconds. (1274.24 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.70 seconds. (1285.18 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 895 frames in 0.64 seconds. (1402.77 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 480, 640])
 - Processed 900 frames in 0.34 seconds. (2639.80 fps)

XGA

src_xga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_xga.h264.mp4")
fps_xga = run_decode_tests(src_xga)
  0%|          | 0.00/9.22M [00:00<?, ?B/s]
 98%|#########7| 9.00M/9.22M [00:00<00:00, 35.8MB/s]
100%|##########| 9.22M/9.22M [00:00<00:00, 36.4MB/s]
Testing: testsrc2_xga.h264.mp4
* Software decoding (num_threads=1)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 2.70 seconds. (333.73 fps)
* Software decoding (num_threads=4)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 1.38 seconds. (652.84 fps)
* Software decoding (num_threads=8)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 1.28 seconds. (703.55 fps)
* Software decoding (num_threads=16)
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 895 frames in 1.30 seconds. (690.26 fps)
* Hardware decoding
 - Shape: torch.Size([5, 3, 768, 1024])
 - Processed 900 frames in 0.61 seconds. (1473.92 fps)

結果

現在我們繪製結果。

def plot():
    fig, ax = plt.subplots(figsize=[9.6, 6.4])

    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
        ax.plot(items[:-1], marker=items[-1])
    ax.grid(axis="both")
    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
    ax.legend(
        [
            "Software Decoding (threads=1)",
            "Software Decoding (threads=4)",
            "Software Decoding (threads=8)",
            "Software Decoding (threads=16)",
            "Hardware Decoding (CUDA Tensor)",
        ]
    )
    ax.set_title("Speed of processing video frames")
    ax.set_ylabel("Frames per second")
    plt.tight_layout()


plot()
Speed of processing video frames

我們觀察到以下幾點

  • 增加軟體解碼的執行緒數可以加快管道速度,但效能在大約 8 個執行緒時趨於飽和。

  • 使用硬體解碼器帶來的效能提升取決於影片的解析度。

  • 在 QVGA 等較低解析度下,硬體解碼慢於軟體解碼

  • 在 XGA 等較高解析度下,硬體解碼快於軟體解碼。

值得注意的是,效能提升還取決於 GPU 的型別。我們觀察到,當使用 V100 或 A100 GPU 解碼 VGA 影片時,硬體解碼器比軟體解碼器慢。但使用 A10 GPU 硬體解碼器則比軟體解碼器快。

解碼並縮放

接下來,我們向管道中新增縮放操作。我們將比較以下管道。

  1. 使用軟體解碼器解碼影片,並將幀作為 PyTorch 張量讀取。使用 torch.nn.functional.interpolate() 縮放張量,然後將結果張量傳送到 CUDA 裝置。

  2. 使用軟體解碼器解碼影片,使用 FFmpeg 的 filter graph 縮放幀,將縮放後的幀作為 PyTorch 張量讀取,然後將其傳送到 CUDA 裝置。

  3. 使用硬體解碼器同時解碼和縮放影片,將結果幀作為 CUDA 張量讀取。

管道 1 代表常見的影片載入實現。

管道 2 使用 FFmpeg 的 filter graph,它允許在將原始幀轉換為張量之前對其進行處理。

管道 3 將從 CPU 到 CUDA 的資料傳輸量降至最低,這顯著有助於實現高效能資料載入。

以下函式實現了管道 1。它使用 PyTorch 的 torch.nn.functional.interpolate()。我們使用 bincubic 模式,因為我們看到結果幀與 NVDEC 縮放最接近。

def test_decode_then_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(frames_per_chunk, decoder_option={"threads": "8"})

    num_frames = 0
    device = torch.device("cuda")
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
        chunk = torch.nn.functional.interpolate(chunk, [height, width], mode=mode, antialias=True)
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了管道 2。幀作為解碼過程的一部分進行縮放,然後傳送到 CUDA 裝置。

我們使用 bincubic 模式,以使結果與上面的基於 PyTorch 的實現具有可比性。

def test_decode_and_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(
        frames_per_chunk, filter_desc=f"scale={width}:{height}:sws_flags={mode}", decoder_option={"threads": "8"}
    )

    num_frames = 0
    device = torch.device("cuda")
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式實現了管道 3。縮放由 NVDEC 執行,結果張量放置在 CUDA 記憶體中。

def test_hw_decode_and_resize(src, decoder, decoder_option, hw_accel="cuda", frames_per_chunk=5):
    s = StreamReader(src)
    s.add_video_stream(5, decoder=decoder, decoder_option=decoder_option, hw_accel=hw_accel)

    num_frames = 0
    chunk = None
    t0 = time.monotonic()
    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
    print(f" - Shape: {chunk.shape}")
    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps

以下函式對給定源執行效能測試函式。

def run_resize_tests(src):
    print(f"Testing: {os.path.basename(src)}")
    height, width = 224, 224
    print("* Software decoding with PyTorch interpolate")
    cpu_resize1 = test_decode_then_resize(src, height=height, width=width)
    print("* Software decoding with FFmpeg scale")
    cpu_resize2 = test_decode_and_resize(src, height=height, width=width)
    print("* Hardware decoding with resize")
    cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": f"{width}x{height}"})
    return [cpu_resize1, cpu_resize2, cuda_resize]

現在我們執行測試。

QVGA

fps_qvga = run_resize_tests(src_qvga)
Testing: testsrc2_qvga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.61 seconds. (1486.29 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.40 seconds. (2229.01 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 2.02 seconds. (444.56 fps)

VGA

fps_vga = run_resize_tests(src_vga)
Testing: testsrc2_vga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 1.45 seconds. (620.26 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.69 seconds. (1300.24 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.34 seconds. (2653.73 fps)

XGA

fps_xga = run_resize_tests(src_xga)
Testing: testsrc2_xga.h264.mp4
* Software decoding with PyTorch interpolate
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 2.69 seconds. (334.90 fps)
* Software decoding with FFmpeg scale
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 1.06 seconds. (850.30 fps)
* Hardware decoding with resize
 - Shape: torch.Size([5, 3, 224, 224])
 - Processed 900 frames in 0.61 seconds. (1476.55 fps)

結果

現在我們繪製結果。

def plot():
    fig, ax = plt.subplots(figsize=[9.6, 6.4])

    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
        ax.plot(items[:-1], marker=items[-1])
    ax.grid(axis="both")
    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
    ax.legend(
        [
            "Software decoding\nwith resize\n(PyTorch interpolate)",
            "Software decoding\nwith resize\n(FFmpeg scale)",
            "NVDEC\nwith resizing",
        ]
    )
    ax.set_title("Speed of processing video frames")
    ax.set_xlabel("Input video resolution")
    ax.set_ylabel("Frames per second")
    plt.tight_layout()


plot()
Speed of processing video frames

硬體解碼器顯示出與之前實驗相似的趨勢。事實上,效能幾乎相同。硬體縮放對幀進行縮小几乎沒有開銷。

軟體解碼也顯示出相似的趨勢。將縮放作為解碼過程的一部分執行速度更快。一種可能的解釋是,影片幀內部儲存為 YUV420P,其畫素數是 RGB24 或 YUV444P 的一半。這意味著如果在將幀資料複製到 PyTorch 張量之前進行縮放,則操作和複製的畫素數量小於在幀轉換為張量後應用縮放的情況。

標籤: torchaudio.io

指令碼總執行時間: ( 0 分 31.872 秒)

Sphinx-Gallery 生成的相簿

文件

訪問 PyTorch 的全面開發者文件

檢視文件

教程

獲取針對初學者和高階開發者的深度教程

檢視教程

資源

查詢開發資源並獲得問題解答

檢視資源