GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance


Abstract

Video outpainting, the process of filling in missing regions at the edges of videos based on existing context, poses substantial challenges in maintaining both local and global coherence. In this paper, we introduce GlobalPaint, a novel approach for video outpainting. GlobalPaint adopts a hierarchical processing framework and employs a diffusion-based model enriched with Enhanced Spatiotemporal (EST) modules and guided by global features. Our EST modules extend pretrained spatial layers by incorporating 3D windowed attention layers alongside conventional 1D temporal layers, ensuring seamless frame transitions for local coherence. To enhance global coherence, GlobalPaint efficiently distills OpenCLIP features into manageable global features, integrating them into the outpainting process through cross-attention operations. Comprehensive evaluations on benchmark datasets demonstrate that GlobalPaint surpasses state-of-the-art models in terms of both image quality and motion naturalness. This work establishes a new benchmark in video outpainting, pushing the boundaries of the state-of-the-art in this field.


Demo Results

Source Video
Outpainted by GlobalPaint


Source Video
Outpainted by GlobalPaint


Source Video
Outpainted by GlobalPaint




Source Video
Outpainted by GlobalPaint
Source Video
Outpainted by GlobalPaint


Source Video
Outpainted by GlobalPaint
Source Video
Outpainted by GlobalPaint



Comparison

Source Video
M3DDM
GlobalPaint

Source Video
M3DDM
GlobalPaint

Source Video
M3DDM
GlobalPaint

Source Video
M3DDM
GlobalPaint

Source Video
MagicEdit
GlobalPaint

Source Video
MagicEdit
GlobalPaint



Ablation Study

Source Video
Baseline
EST(Enhanced Spatial-Temporal modules)
GlobalPaint

Source Video
Sequential
Hierarchical