Text-to-video (T2V) generation has recently garnered significant attention, largely due to the advanced multi-modality model, Sora. However, current T2V generation in research community still faces two major challenges: 1) The absence of a precise, high-quality open-source dataset. Previous popular video datasets, such as WebVid-10M and Panda-70M, are either of low quality or too large for most research institutions. Collecting precise, high-quality text-video pairs is both challenging and essential for T2V generation. 2) Inadequate utilization of textual information. Recent T2V methods focus on vision transformers, employing a simple cross-attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.
To address these issues, we introduce OpenVid-1M, a high-quality dataset with expressive captions. This open-scenario dataset comprises over 1 million text-video pairs, facilitating T2V generation research. Additionally, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Furthermore, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT), capable of extracting structural information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.
Please be patient for the videos to load.
Comparisons on video statistics between OpenVid-1M and Panda-50M.
Dataset | Scenario | Video clips | Aesthetics and Clarity | Resolution | Caption |
---|---|---|---|---|---|
UCF101 | Action | 13K | Low | 320×240 | N/A |
Taichi-HD | Human | 3K | Low | 256×256 | N/A |
SkyTimelapse | Sky | 35K | Medium | 640×360 | N/A |
FaceForensics++ | Face | 1K | Medium | Diverse | N/A |
WebVid | Open | 10M | Low | 596×336 | Short |
ChronoMagic | Metamorphic | 2K | High | Diverse | Long |
CelebvHQ | Portrait | 35K | High | 512×512 | N/A |
OpenSoraPlan-V1.0 | Open | 400K | Medium | 512×512 | Long |
Panda | Open | 70M | Medium | Diverse | Short |
OpenVid-1M(Ours) | Open | 1M | High | Diverse | Long |
OpenVidHD-0.4M(Ours) | Open | 433K | High (1080P) | 1920×1080 | Long |
Comparisons with previous text-to-video datasets. Our OpenVid-1M is a million-level, high-quality and open-scenario video dataset for training high-fidelity text-to-video models.
Video clarity distribution of OpenVid-1M. We also present 4 face samples to visualize the clarity differences. Faces outlined in green contour are blurry with low clarity scores, while those outlined in red contour are clearer with high clarity.
Method | Resolution | Training Data | VQAA↑ | VQAT↑ | Blip_bleu↑ | SD_score↑ | Clip_temp_score↑ | Warping_error↓ |
---|---|---|---|---|---|---|---|---|
Lavie[5] | 512×320 | Vimeo25M | 63.77 | 42.59 | 22.38 | 68.18 | 99.57 | 0.0089 |
Show-1[6] | 576×320 | WebVid-10M | 23.19 | 44.24 | 23.24 | 68.42 | 99.77 | 0.0067 |
OpenSora-V1.1[9] | 512×512 | Self collected-10M | 22.04 | 23.62 | 23.60 | 67.66 | 99.66 | 0.0170 |
OpenSoraPlan-V1.1[4] | 512×512 | Self collected-4.8M | 51.16 | 58.19 | 23.21 | 68.43 | 99.95 | 0.0026 |
Latte[7] | 512×512 | Self collected-330K | 55.46 | 48.93 | 22.39 | 68.06 | 99.59 | 0.0203 |
VideoCrafter[3] | 1024×576 | WebVid-10M; Laion-600M | 66.18 | 58.93 | 22.17 | 68.73 | 99.78 | 0.0295 |
Modelscope[8] | 1280×720 | Self collected-Billions | 40.06 | 32.93 | 22.54 | 67.93 | 99.74 | 0.0162 |
Pika | 1088×612 | Unknown | 59.09 | 64.96 | 21.14 | 68.57 | 99.97 | 0.0006 |
Ours | 1024×1024 | OpenVid-1M | 73.46 | 68.58 | 23.45 | 68.04 | 99.87 | 0.0052 |
Resolution | Training Data | VQAA↑ | VQAT↑ | Blip_bleu↑ | SD_score↑ | Clip_temp_score↑ | Warping_error↓ |
---|---|---|---|---|---|---|---|
256×256 | WebVid-10M[10] | 13.40 | 13.34 | 23.45 | 67.64 | 99.62 | 0.0138 |
256×256 | Panda-50M[11] | 17.08 | 9.60 | 24.06 | 67.47 | 99.60 | 0.0200 |
256×256 | OpenVid-1M(Ours) | 17.78 | 12.98 | 24.93 | 67.77 | 99.75 | 0.0134 |
1024×1024 | WebVid-10M (4× Super-resolution) | 69.26 | 65.74 | 23.15 | 67.60 | 99.64 | 0.0137 |
1024×1024 | Panda-50M (4× Super-resolution) | 63.25 | 53.21 | 23.60 | 67.44 | 99.57 | 0.0163 |
1024×1024 | OpenVidHD-0.4M(Ours) | 73.46 | 68.58 | 23.45 | 68.04 | 99.85 | 0.0132 |
Model | Resolution | Training Data | Pretrained Weight | VQAA↑ | VQAT↑ | Blip_bleu↑ | SD_score↑ | Clip_temp_score↑ | Warping_error↓ |
---|---|---|---|---|---|---|---|---|---|
STDiT | 256×256 | Ours-0.4M | PixArt-α | 11.11 | 12.46 | 24.55 | 67.96 | 99.81 | 0.0105 |
STDiT | 512×512 | Ours-0.4M | STDiT-256 | 65.15 | 59.57 | 23.73 | 68.24 | 99.80 | 0.0089 |
MVDiT | 256×256 | Ours-0.4M | PixArt-α | 22.39 | 14.15 | 23.72 | 67.73 | 99.71 | 0.0091 |
MVDiT | 256×256 | OpenVid-1M | PixArt-α | 24.87 | 14.57 | 24.01 | 67.64 | 99.75 | 0.0081 |
MVDiT | 512×512 | OpenVid-1M | MVDiT-256 | 66.65 | 63.96 | 24.14 | 68.31 | 99.83 | 0.0008 |
[1] Gen2 https://research.runwayml.com/gen2
[2] Pika https://pikalabs.org/
[3] VideoCrafter2 https://github.com/AILab-CVC/VideoCrafter
[4] OpenSoraPlan-V1.1 https://github.com/PKU-YuanGroup/Open-Sora-Plan
[5] Lavie https://github.com/Vchitect/LaVie
[6] Show-1 https://github.com/showlab/Show-1
[7] Latte https://github.com/Vchitect/Latte
[8] Modelscope https://github.com/modelscope/modelscope
[9] OpenSora-V1.1 https://github.com/hpcaitech/Open-Sora
[10] WebVid-10M Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
[11] Panda-50M Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.
[12] Luma https://lumalabs.ai/dream-machine