Loading...



Abstract

Text-to-video (T2V) generation has recently garnered significant attention, largely due to the advanced multi-modality model, Sora. However, current T2V generation in research community still faces two major challenges: 1) The absence of a precise, high-quality open-source dataset. Previous popular video datasets, such as WebVid-10M and Panda-70M, are either of low quality or too large for most research institutions. Collecting precise, high-quality text-video pairs is both challenging and essential for T2V generation. 2) Inadequate utilization of textual information. Recent T2V methods focus on vision transformers, employing a simple cross-attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.

To address these issues, we introduce OpenVid-1M, a high-quality dataset with expressive captions. This open-scenario dataset comprises over 1 million text-video pairs, facilitating T2V generation research. Additionally, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Furthermore, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT), capable of extracting structural information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.



1024✖️1024 video results trained by OpenVid-1M




Visual comparisons with SOTA models



Please be patient for the videos to load.



Video Statistics and Dataset Characteristics


dataset_distribution

Comparisons on video statistics between OpenVid-1M and Panda-50M.

Dataset Scenario Video clips Aesthetics and Clarity Resolution Caption
UCF101 Action 13K Low 320×240 N/A
Taichi-HD Human 3K Low 256×256 N/A
SkyTimelapse Sky 35K Medium 640×360 N/A
FaceForensics++ Face 1K Medium Diverse N/A
WebVid Open 10M Low 596×336 Short
ChronoMagic Metamorphic 2K High Diverse Long
CelebvHQ Portrait 35K High 512×512 N/A
OpenSoraPlan-V1.0 Open 400K Medium 512×512 Long
Panda Open 70M Medium Diverse Short
OpenVid-1M(Ours) Open 1M High Diverse Long
OpenVidHD-0.4M(Ours) Open 433K High (1080P) 1920×1080 Long

Comparisons with previous text-to-video datasets. Our OpenVid-1M is a million-level, high-quality and open-scenario video dataset for training high-fidelity text-to-video models.

video_clarity_distribution

Video clarity distribution of OpenVid-1M. We also present 4 face samples to visualize the clarity differences. Faces outlined in green contour are blurry with low clarity scores, while those outlined in red contour are clearer with high clarity.


Quantitative comparisons with SOTA models


Method Resolution Training Data VQAA VQAT Blip_bleu↑ SD_score↑ Clip_temp_score↑ Warping_error↓
Lavie[5] 512×320 Vimeo25M 63.77 42.59 22.38 68.18 99.57 0.0089
Show-1[6] 576×320 WebVid-10M 23.19 44.24 23.24 68.42 99.77 0.0067
OpenSora-V1.1[9] 512×512 Self collected-10M 22.04 23.62 23.60 67.66 99.66 0.0170
OpenSoraPlan-V1.1[4] 512×512 Self collected-4.8M 51.16 58.19 23.21 68.43 99.95 0.0026
Latte[7] 512×512 Self collected-330K 55.46 48.93 22.39 68.06 99.59 0.0203
VideoCrafter[3] 1024×576 WebVid-10M; Laion-600M 66.18 58.93 22.17 68.73 99.78 0.0295
Modelscope[8] 1280×720 Self collected-Billions 40.06 32.93 22.54 67.93 99.74 0.0162
Pika 1088×612 Unknown 59.09 64.96 21.14 68.57 99.97 0.0006
Ours 1024×1024 OpenVid-1M 73.46 68.58 23.45 68.04 99.87 0.0052


Comparisons with previous representative text-to-video training datasets


Resolution Training Data VQAA VQAT Blip_bleu↑ SD_score↑ Clip_temp_score↑ Warping_error↓
256×256 WebVid-10M[10] 13.40 13.34 23.45 67.64 99.62 0.0138
256×256 Panda-50M[11] 17.08 9.60 24.06 67.47 99.60 0.0200
256×256 OpenVid-1M(Ours) 17.78 12.98 24.93 67.77 99.75 0.0134
1024×1024 WebVid-10M (4× Super-resolution) 69.26 65.74 23.15 67.60 99.64 0.0137
1024×1024 Panda-50M (4× Super-resolution) 63.25 53.21 23.60 67.44 99.57 0.0163
1024×1024 OpenVidHD-0.4M(Ours) 73.46 68.58 23.45 68.04 99.85 0.0132


Ablations on Resolutions, Architectures, and Training Data


Model Resolution Training Data Pretrained Weight VQAA VQAT Blip_bleu↑ SD_score↑ Clip_temp_score↑ Warping_error↓
STDiT 256×256 Ours-0.4M PixArt-α 11.11 12.46 24.55 67.96 99.81 0.0105
STDiT 512×512 Ours-0.4M STDiT-256 65.15 59.57 23.73 68.24 99.80 0.0089
MVDiT 256×256 Ours-0.4M PixArt-α 22.39 14.15 23.72 67.73 99.71 0.0091
MVDiT 256×256 OpenVid-1M PixArt-α 24.87 14.57 24.01 67.64 99.75 0.0081
MVDiT 512×512 OpenVid-1M MVDiT-256 66.65 63.96 24.14 68.31 99.83 0.0008

[1] Gen2 https://research.runwayml.com/gen2

[2] Pika https://pikalabs.org/

[3] VideoCrafter2 https://github.com/AILab-CVC/VideoCrafter

[4] OpenSoraPlan-V1.1 https://github.com/PKU-YuanGroup/Open-Sora-Plan

[5] Lavie https://github.com/Vchitect/LaVie

[6] Show-1 https://github.com/showlab/Show-1

[7] Latte https://github.com/Vchitect/Latte

[8] Modelscope https://github.com/modelscope/modelscope

[9] OpenSora-V1.1 https://github.com/hpcaitech/Open-Sora

[10] WebVid-10M Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.

[11] Panda-50M Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.

[12] Luma https://lumalabs.ai/dream-machine