光流：使用 RAFT 模型預測運動¶

注意

在 Colab 上嘗試，或跳轉至文末下載完整示例程式碼。

光流的任務是預測兩張影像（通常是影片的連續兩幀）之間的運動。光流模型接收兩張影像作為輸入，並預測一個流：該流指示第一張影像中每個畫素的位移，並將其對映到第二張影像中對應的畫素。流是 (2, H, W) 維的張量，其中第一個軸對應於預測的水平和垂直位移。

以下示例演示瞭如何使用 torchvision 透過我們實現的 RAFT 模型來預測流。我們還將展示如何將預測的流轉換為 RGB 影像進行視覺化。

import numpy as np
import torch
import matplotlib.pyplot as plt
import torchvision.transforms.functional as F


plt.rcParams["savefig.bbox"] = "tight"


def plot(imgs, **imshow_kwargs):
    if not isinstance(imgs[0], list):
        # Make a 2d grid even if there's just 1 row
        imgs = [imgs]

    num_rows = len(imgs)
    num_cols = len(imgs[0])
    _, axs = plt.subplots(nrows=num_rows, ncols=num_cols, squeeze=False)
    for row_idx, row in enumerate(imgs):
        for col_idx, img in enumerate(row):
            ax = axs[row_idx, col_idx]
            img = F.to_pil_image(img.to("cpu"))
            ax.imshow(np.asarray(img), **imshow_kwargs)
            ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

    plt.tight_layout()

使用 Torchvision 讀取影片¶

我們將首先使用 read_video() 讀取影片。或者，也可以使用新的 VideoReader API（如果 torchvision 是從原始碼構建的）。我們此處使用的影片來自 pexels.com，可免費使用，感謝 Pavel Danilyuk。

import tempfile
from pathlib import Path
from urllib.request import urlretrieve


video_url = "https://download.pytorch.org/tutorial/pexelscom_pavel_danilyuk_basketball_hd.mp4"
video_path = Path(tempfile.mkdtemp()) / "basketball.mp4"
_ = urlretrieve(video_url, video_path)

read_video() 返回影片幀、音訊幀以及與影片相關的元資料。在我們的例子中，我們只需要影片幀。

在此，我們將在 2 對預選的幀之間進行 2 次預測，即幀 (100, 101) 和 (150, 151)。每一對幀都對應於一個模型輸入。

from torchvision.io import read_video
frames, _, _ = read_video(str(video_path), output_format="TCHW")

img1_batch = torch.stack([frames[100], frames[150]])
img2_batch = torch.stack([frames[101], frames[151]])

plot(img1_batch)

/pytorch/vision/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/pytorch/vision/torchvision/io/video.py:199: UserWarning: The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.
  warnings.warn("The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.")

RAFT 模型接受 RGB 影像。我們首先從 read_video() 獲取幀，並調整它們的大小以確保其維度能被 8 整除。請注意，我們顯式地使用了 antialias=False，因為這些模型是這樣訓練的。然後，我們使用捆綁在權重中的轉換來預處理輸入，並將其值縮放到所需的 [-1, 1] 範圍。

from torchvision.models.optical_flow import Raft_Large_Weights

weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()


def preprocess(img1_batch, img2_batch):
    img1_batch = F.resize(img1_batch, size=[520, 960], antialias=False)
    img2_batch = F.resize(img2_batch, size=[520, 960], antialias=False)
    return transforms(img1_batch, img2_batch)


img1_batch, img2_batch = preprocess(img1_batch, img2_batch)

print(f"shape = {img1_batch.shape}, dtype = {img1_batch.dtype}")

shape = torch.Size([2, 3, 520, 960]), dtype = torch.float32

使用 RAFT 估計光流¶

我們將使用 raft_large() 中的 RAFT 實現，該實現遵循了原始論文中描述的相同架構。我們還提供了 raft_small() 模型構建器，它更小、執行速度更快，但會犧牲一點精度。

from torchvision.models.optical_flow import raft_large

# If you can, run this example on a GPU, it will be a lot faster.
device = "cuda" if torch.cuda.is_available() else "cpu"

model = raft_large(weights=Raft_Large_Weights.DEFAULT, progress=False).to(device)
model = model.eval()

list_of_flows = model(img1_batch.to(device), img2_batch.to(device))
print(f"type = {type(list_of_flows)}")
print(f"length = {len(list_of_flows)} = number of iterations of the model")

Downloading: "https://download.pytorch.org/models/raft_large_C_T_SKHT_V2-ff5fadd5.pth" to /root/.cache/torch/hub/checkpoints/raft_large_C_T_SKHT_V2-ff5fadd5.pth
type = <class 'list'>
length = 12 = number of iterations of the model

RAFT 模型輸出預測流的列表，其中每個條目都是一個 (N, 2, H, W) 的預測流批次，對應於模型中的給定“迭代”。有關模型的迭代性質的更多詳細資訊，請參閱原始論文。在此，我們只對最終預測的流（它們最準確）感興趣，因此我們只需檢索列表中的最後一項。

如上所述，流是維度為 (2, H, W) 的張量（對於流的批次而言是 (N, 2, H, W)），其中每個條目對應於每個畫素從第一張影像到第二張影像的水平和垂直位移。請注意，預測的流是以“畫素”為單位的，並未相對於影像的尺寸進行歸一化。

predicted_flows = list_of_flows[-1]
print(f"dtype = {predicted_flows.dtype}")
print(f"shape = {predicted_flows.shape} = (N, 2, H, W)")
print(f"min = {predicted_flows.min()}, max = {predicted_flows.max()}")

dtype = torch.float32
shape = torch.Size([2, 2, 520, 960]) = (N, 2, H, W)
min = -3.8997180461883545, max = 6.400400161743164

視覺化預測的流¶

Torchvision 提供了 flow_to_image() 工具函式，用於將流轉換為 RGB 影像。它也支援流的批次處理。流中的每個“方向”都會對映到給定的 RGB 顏色。在下面的影像中，模型假定顏色相似的畫素朝著相似的方向移動。該模型能夠正確預測球和球員的運動。特別注意球在第一張影像中（向左移動）和在第二張影像中（向上移動）預測的不同方向。

from torchvision.utils import flow_to_image

flow_imgs = flow_to_image(predicted_flows)

# The images have been mapped into [-1, 1] but for plotting we want them in [0, 1]
img1_batch = [(img1 + 1) / 2 for img1 in img1_batch]

grid = [[img1, flow_img] for (img1, flow_img) in zip(img1_batch, flow_imgs)]
plot(grid)

附加：建立預測流的 GIF¶

在上面的示例中，我們只展示了 2 對幀的預測流。應用光流模型的一種有趣方式是對整個影片執行模型，並從所有預測的流建立新的影片。下面是一個程式碼片段，可以幫助您開始。我們註釋掉了程式碼，因為這個示例是在沒有 GPU 的機器上渲染的，執行時間會太長。

# from torchvision.io import write_jpeg
# for i, (img1, img2) in enumerate(zip(frames, frames[1:])):
#     # Note: it would be faster to predict batches of flows instead of individual flows
#     img1, img2 = preprocess(img1, img2)

#     list_of_flows = model(img1.to(device), img2.to(device))
#     predicted_flow = list_of_flows[-1][0]
#     flow_img = flow_to_image(predicted_flow).to("cpu")
#     output_folder = "/tmp/"  # Update this to the folder of your choice
#     write_jpeg(flow_img, output_folder + f"predicted_flow_{i}.jpg")

儲存 .jpg 流影像後，您可以使用 ffmpeg 將它們轉換為影片或 GIF，例如：

ffmpeg -f image2 -framerate 30 -i predicted_flow_%d.jpg -loop -1 flow.gif

指令碼總執行時間： (0 分 8.949 秒)

由 Sphinx-Gallery 生成的圖集