OpenVid-1M:A Large-Scale Dataset for High-Quality Text-to-Video Generation

Abstract

Text-to-video (T2V) generation has recently garnered significant attention, largely due to the advanced multi-modality model, Sora. However, current T2V generation in research community still faces two major challenges: 1) The absence of a precise, high-quality open-source dataset. Previous popular video datasets, such as WebVid-10M and Panda-70M, are either of low quality or too large for most research institutions. Collecting precise, high-quality text-video pairs is both challenging and essential for T2V generation. 2) Inadequate utilization of textual information. Recent T2V methods focus on vision transformers, employing a simple cross-attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.

To address these issues, we introduce OpenVid-1M, a high-quality dataset with expressive captions. This open-scenario dataset comprises over 1 million text-video pairs, facilitating T2V generation research. Additionally, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Furthermore, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT), capable of extracting structural information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Dataset	Scenario	Video clips	Aesthetics and Clarity	Resolution	Caption
UCF101	Action	13K	Low	320×240	N/A
Taichi-HD	Human	3K	Low	256×256	N/A
SkyTimelapse	Sky	35K	Medium	640×360	N/A
FaceForensics++	Face	1K	Medium	Diverse	N/A
WebVid	Open	10M	Low	596×336	Short
ChronoMagic	Metamorphic	2K	High	Diverse	Long
CelebvHQ	Portrait	35K	High	512×512	N/A
OpenSoraPlan-V1.0	Open	400K	Medium	512×512	Long
Panda	Open	70M	Medium	Diverse	Short
OpenVid-1M(Ours)	Open	1M	High	Diverse	Long
OpenVidHD-0.4M(Ours)	Open	433K	High (1080P)	1920×1080	Long

Dataset

Scenario

Video clips

Aesthetics and Clarity

Resolution

Caption

UCF101

Action

13K

Low

320×240

N/A

Taichi-HD

Human

Low

256×256

N/A

SkyTimelapse

Sky

35K

Medium

640×360

N/A

FaceForensics++

Face

Medium

Diverse

N/A

WebVid

Open

10M

Low

596×336

Short

ChronoMagic

Metamorphic

High

Diverse

Long

CelebvHQ

Portrait

35K

High

512×512

N/A

OpenSoraPlan-V1.0

Open

400K

Medium

512×512

Long

Panda

Open

70M

Medium

Diverse

Short

OpenVid-1M(Ours)

Open

High

Diverse

Long

OpenVidHD-0.4M(Ours)

Open

433K

High (1080P)

1920×1080

Long

Method	Resolution	Training Data	VQA_A↑	VQA_T↑	Blip_bleu↑	SD_score↑	Clip_temp_score↑	Warping_error↓
Lavie^[5]	512×320	Vimeo25M	63.77	42.59	22.38	68.18	99.57	0.0089
Show-1^[6]	576×320	WebVid-10M	23.19	44.24	23.24	68.42	99.77	0.0067
OpenSora-V1.1^[9]	512×512	Self collected-10M	22.04	23.62	23.60	67.66	99.66	0.0170
OpenSoraPlan-V1.1^[4]	512×512	Self collected-4.8M	51.16	58.19	23.21	68.43	99.95	0.0026
Latte^[7]	512×512	Self collected-330K	55.46	48.93	22.39	68.06	99.59	0.0203
VideoCrafter^[3]	1024×576	WebVid-10M; Laion-600M	66.18	58.93	22.17	68.73	99.78	0.0295
Modelscope^[8]	1280×720	Self collected-Billions	40.06	32.93	22.54	67.93	99.74	0.0162
Pika	1088×612	Unknown	59.09	64.96	21.14	68.57	99.97	0.0006
Ours	1024×1024	OpenVid-1M	73.46	68.58	23.45	68.04	99.87	0.0052

Method

Resolution

Training Data

VQA_A↑

VQA_T↑

Blip_bleu↑

SD_score↑

Clip_temp_score↑

Warping_error↓

Lavie^[5]

512×320

Vimeo25M

63.77

42.59

22.38

68.18

99.57

0.0089

Show-1^[6]

576×320

WebVid-10M

23.19

44.24

23.24

68.42

99.77

0.0067

OpenSora-V1.1^[9]

512×512

Self collected-10M

22.04

23.62

23.60

67.66

99.66

0.0170

OpenSoraPlan-V1.1^[4]

512×512

Self collected-4.8M

51.16

58.19

23.21

68.43

99.95

0.0026

Latte^[7]

512×512

Self collected-330K

55.46

48.93

22.39

68.06

99.59

0.0203

VideoCrafter^[3]

1024×576

WebVid-10M; Laion-600M

66.18

58.93

22.17

68.73

99.78

0.0295

Modelscope^[8]

1280×720

Self collected-Billions

40.06

32.93

22.54

67.93

99.74

0.0162

Pika

1088×612

Unknown

59.09

64.96

21.14

68.57

99.97

0.0006

Ours

1024×1024

OpenVid-1M

73.46

68.58

23.45

68.04

99.87

0.0052

Resolution	Training Data	VQA_A↑	VQA_T↑	Blip_bleu↑	SD_score↑	Clip_temp_score↑	Warping_error↓
256×256	WebVid-10M^[10]	13.40	13.34	23.45	67.64	99.62	0.0138
256×256	Panda-50M^[11]	17.08	9.60	24.06	67.47	99.60	0.0200
256×256	OpenVid-1M(Ours)	17.78	12.98	24.93	67.77	99.75	0.0134
1024×1024	WebVid-10M (4× Super-resolution)	69.26	65.74	23.15	67.60	99.64	0.0137
1024×1024	Panda-50M (4× Super-resolution)	63.25	53.21	23.60	67.44	99.57	0.0163
1024×1024	OpenVidHD-0.4M(Ours)	73.46	68.58	23.45	68.04	99.85	0.0132

Resolution

Training Data

VQA_A↑

VQA_T↑

Blip_bleu↑

SD_score↑

Clip_temp_score↑

Warping_error↓

256×256

WebVid-10M^[10]

13.40

13.34

23.45

67.64

99.62

0.0138

256×256

Panda-50M^[11]

17.08

9.60

24.06

67.47

99.60

0.0200

256×256

OpenVid-1M(Ours)

17.78

12.98

24.93

67.77

99.75

0.0134

1024×1024

WebVid-10M (4× Super-resolution)

69.26

65.74

23.15

67.60

99.64

0.0137

1024×1024

Panda-50M (4× Super-resolution)

63.25

53.21

23.60

67.44

99.57

0.0163

1024×1024

OpenVidHD-0.4M(Ours)

73.46

68.58

23.45

68.04

99.85

0.0132

Model	Resolution	Training Data	Pretrained Weight	VQA_A↑	VQA_T↑	Blip_bleu↑	SD_score↑	Clip_temp_score↑	Warping_error↓
STDiT	256×256	Ours-0.4M	PixArt-α	11.11	12.46	24.55	67.96	99.81	0.0105
STDiT	512×512	Ours-0.4M	STDiT-256	65.15	59.57	23.73	68.24	99.80	0.0089
MVDiT	256×256	Ours-0.4M	PixArt-α	22.39	14.15	23.72	67.73	99.71	0.0091
MVDiT	256×256	OpenVid-1M	PixArt-α	24.87	14.57	24.01	67.64	99.75	0.0081
MVDiT	512×512	OpenVid-1M	MVDiT-256	66.65	63.96	24.14	68.31	99.83	0.0008

Model

Resolution

Training Data

Pretrained Weight

VQA_A↑

VQA_T↑

Blip_bleu↑

SD_score↑

Clip_temp_score↑

Warping_error↓

STDiT

256×256

Ours-0.4M

PixArt-α

11.11

12.46

24.55

67.96

99.81

0.0105

STDiT

512×512

Ours-0.4M

STDiT-256

65.15

59.57

23.73

68.24

99.80

0.0089

MVDiT

256×256

Ours-0.4M

PixArt-α

22.39

14.15

23.72

67.73

99.71

0.0091

MVDiT

256×256

OpenVid-1M

PixArt-α

24.87

14.57

24.01

67.64

99.75

0.0081

MVDiT

512×512

OpenVid-1M

MVDiT-256

66.65

63.96

24.14

68.31

99.83

0.0008

[1] Gen2 https://research.runwayml.com/gen2

[2] Pika https://pikalabs.org/

[3] VideoCrafter2 https://github.com/AILab-CVC/VideoCrafter

[4] OpenSoraPlan-V1.1 https://github.com/PKU-YuanGroup/Open-Sora-Plan

[5] Lavie https://github.com/Vchitect/LaVie

[6] Show-1 https://github.com/showlab/Show-1

[7] Latte https://github.com/Vchitect/Latte

[8] Modelscope https://github.com/modelscope/modelscope

[9] OpenSora-V1.1 https://github.com/hpcaitech/Open-Sora

[10] WebVid-10M Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.

[11] Panda-50M Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.

[12] Luma https://lumalabs.ai/dream-machine

OpenVid-1M

A Large-Scale Dataset for High-Quality Text-to-Video Generation

Abstract

1024✖️1024 video results trained by OpenVid-1M

Visual comparisons with SOTA models

Video Statistics and Dataset Characteristics

Quantitative comparisons with SOTA models

Comparisons with previous representative text-to-video training datasets

Ablations on Resolutions, Architectures, and Training Data