DiVE: DiT-based Video Generation with Enhanced Control

1Harbin Institute of Technology (Shenzhen), 2Li Auto Inc., 3Tsinghua University, 4Westlake University, 5National University of Singapore
(*Co-first authors. Corresponding Author)

DiVE won first place in the Corner Case Scene Generation track of the Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving challenge.

MY ALT TEXT

The structural implementation with each individual components in our proposed method.

Abstract

Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.

Long video generation on nuScenes dataset

Long videos generated by DiVE (up to 240 frames at 12 Hz) on the nuScenes dataset.

Sunny Day

Rainy Day

At Night

Quality of Controllable Generation

MY ALT TEXT

Quantitative comparison with MagicDrive. DTC, CTC and IQ represent DINO Temporal Consistency, CLIP Temporal Consistency and Imaging Quality, respectively. The best performances are presented in bold.

MY ALT TEXT

Qualitative comparison of multi-view videos generated by our model and MagicDrive.

MY ALT TEXT

The use-case of scene editing.

Challenge Results

MY ALT TEXT
MY ALT TEXT

BibTeX

BibTex Code Here