PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

Abstract

We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. Existing methods either rely solely on camera parameters, leading to imprecise motion control, or utilize rendered proxy videos that are vulnerable to depth estimation errors. In contrast, PostCam integrates both modalities via a novel query-shared cross-attention module that effectively aligns 6-DoF camera poses with 2D rendered frames in a shared feature space. To further enhance generation quality, we introduce a two-stage training strategy: the model first learns coarse camera control from pose inputs, then incorporates visual information to refine motion accuracy and visual fidelity. Experiments on both real-world and synthetic datasets show that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and achieves the highest video quality. Our code will be released publicly.