VASE: Object-Centric Shape and Appearance Manipulation of Real Videos

1University of Trento   2Picsart AI Research   3UT Austin   4Georgia Tech
arXiv Code
TL;DR: We edit an object of a real video modifying the shape through a keyframe and the appearance through a driving image.

Abstract

Recently, several works tackled the video editing task fostered by the success of large-scale text-to-image generative models. However, most of these methods holistically edit the frame using the text, exploiting the prior given by foundation diffusion models and focusing on improving the temporal consistency across frames. In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.

Method

We enable video editing by conditioning the synthesis of the video on two branches, one that controls the appearance and the other responsible for the motion and the structure of the object. To enable shape modifications, we propose a Joint Flow-Structure Augmentation pipeline that outputs an augmented flow which is processed by a Flow-Completion Network, before going as input to the final ControlNet module. Furthermore, we introduce an auxiliary loss, used to enhance the model fidelity to the input segmentation map.

Results

Joint Apperance and Shape Edits

The objective is to alter the structure of one object in a source video, guided by the initial edited keyframe; simultaneously, a driver image is employed to edit the appearance. Note that the edit is restricted to one specific object, and the background should remain unaffected. The shape edits is shown for each frame only for visualization purposes, in practice only the first keyframe is provided

Source Video

Shape Edits

Shape-NLA [1]

VASE (Our)

Source Video

Shape Edits

Shape-NLA [1]

VASE (Our)

Source Video

Shape Edits

Shape-NLA [1]

VASE (Our)

Source Video

Shape Edits

Shape-NLA [1]

VASE (Our)


Image Driven Appearance Editing

The model is tasked with editing a target object in a source video based on the guidance from a driver image. It is important to emphasize that both the structure of the object and the background should remain unchanged.

Source Video

Reference Image

PAIR-Diffusion [2]

FILM [3]

TokenFlow [4]

VASE (Our)

Source Video

Reference Image

PAIR-Diffusion [2]

FILM [3]

TokenFlow [4]

VASE (Our)



BibTeX


        @misc{peruzzo2024vase,
          title={VASE: Object-Centric Appearance and Shape Manipulation of Real Videos},
          author={Elia Peruzzo and Vidit Goel and Dejia Xu and Xingqian Xu and Yifan Jiang and Zhangyang Wang and Humphrey Shi and Nicu Sebe},
          year={2024},
          eprint={2401.02473},
          archivePrefix={arXiv},
          primaryClass={cs.CV}}
      

[1] Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, Jia-Bin Huang. Shape-aware Text-driven Layered Video Editing. CVPR, 2023.

[2] Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Xingqian Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi. PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor. arXiv preprint 2023.

[3] Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. ECCV, 2022.

[4] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint 2023.