Video outpainting, the process of filling in missing regions at the edges of videos based on existing context, poses substantial challenges in maintaining both local and global coherence. In this paper, we introduce GlobalPaint, a novel approach for video outpainting. GlobalPaint adopts a hierarchical processing framework and employs a diffusion-based model enriched with Enhanced Spatiotemporal (EST) modules and guided by global features. Our EST modules extend pretrained spatial layers by incorporating 3D windowed attention layers alongside conventional 1D temporal layers, ensuring seamless frame transitions for local coherence. To enhance global coherence, GlobalPaint efficiently distills OpenCLIP features into manageable global features, integrating them into the outpainting process through cross-attention operations. Comprehensive evaluations on benchmark datasets demonstrate that GlobalPaint surpasses state-of-the-art models in terms of both image quality and motion naturalness. This work establishes a new benchmark in video outpainting, pushing the boundaries of the state-of-the-art in this field.