Text-Driven Stylization of Video Objects

Sebastian Loeschcke1, Serge Belongie2, Sagie Benaim2

   Aarhus University1 University of Copenhagen2

Paper Presentation


We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to two target texts. The first target text prompt describes the global semantics and the second target text prompt describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized global views. We use a pretrained atlas decomposition network to propagate the edits in a temporally consistent manner. We demonstrate that our method can generate consistent style changes over time for a variety of objects and videos, that adhere to the specification of the target texts. We also show how varying the specificity of the target texts and augmenting the texts with a set of prefixes results in stylizations with different levels of detail.

TL;DR We propose a framework for stylizing objects in videos using a text-prompt.


Our results show how we can succesfully apply a stylization to a variety of objects and videos. We also demonstrate how using prefixes and an increasing specificity for the target texts, results in more detailed stylizations

Example Results

The following results show how we can successfully apply a stylization to a variety of objects and videos for a different set of target texts. The first video corresponds to the input video and the others correspond to stylizations applied to the input video.

Target Text Specificity

The following results show an experiment where we vary the specificity of a target and how it affects the stylization. The first row shows the swan examples and the second row shows the boat examples. Global target text prompts:

Prefix Augmentations

The following results show how varying the number of prefixes used for the target text affects the stylization. The experiment configurations are (a) no prefixes, (b) 4 local, no global, (c) 4 global, no local (d) 4 global & 4 local, (e) 8 global & 8 local. The specific prefixes are described in Section 6.3 of the paper. The examples use the following texts:

Ablation of Losses

The following results show the affects of each of our loss terms. We ablate each loss term and show how it effects the stylization and quality. The global target text used: ”A swan with crocodile skin”. The local target text used: ”Crocodile skin”.

Global and Local Semantics

The following videos show how changing the local and global target texts affects the stylization. The first row of target texts shows the global target text and the second row shows the local target text:



This research was supported by the Pioneer Centre for AI, DNRF grant number P1.