Text-Driven Stylization of Video Objects

Abstract

We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to two target texts. The first target text prompt describes the global semantics and the second target text prompt describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized global views. We use a pretrained atlas decomposition network to propagate the edits in a temporally consistent manner. We demonstrate that our method can generate consistent style changes over time for a variety of objects and videos, that adhere to the specification of the target texts. We also show how varying the specificity of the target texts and augmenting the texts with a set of prefixes results in stylizations with different levels of detail.

TL;DR We propose a framework for stylizing objects in videos using a text-prompt.

Results

Our results show how we can succesfully apply a stylization to a variety of objects and videos. We also demonstrate how using prefixes and an increasing specificity for the target texts, results in more detailed stylizations

Example Results

The following results show how we can successfully apply a stylization to a variety of objects and videos for a different set of target texts. The first video corresponds to the input video and the others correspond to stylizations applied to the input video.

Target Text Specificity

The following results show an experiment where we vary the specificity of a target and how it affects the stylization. The first row shows the swan examples and the second row shows the boat examples. Global target text prompts:

Row one: (a) ”Armor”, (b) ”Iron armor”, (c) ”Medieval iron armor”, (d) ”Suit of shiny medieval iron armor”, (e) ”Full plate shiny medieval iron armor”
Row two: (a) ”Boat made of wood”, (b) ”Boat made of dark walnut wood" (c) ”Fishing boat made of wood”, (d) ”Old fishing boat made of wood”, (e) ”Fishing boat made of wood planks”.

Prefix Augmentations

The following results show how varying the number of prefixes used for the target text affects the stylization. The experiment configurations are (a) no prefixes, (b) 4 local, no global, (c) 4 global, no local (d) 4 global & 4 local, (e) 8 global & 8 local. The specific prefixes are described in Section 6.3 of the paper. The examples use the following texts:

Row one: Global target text: ”Origami swan with white paper skin”, and local target text: ”Origami white paper skin”.
Row two: Global target text: ”Dog with bengal tiger fur”, and local target text: ”Bengal tiger fur”.

Ablation of Losses

The following results show the affects of each of our loss terms. We ablate each loss term and show how it effects the stylization and quality. The global target text used: ”A swan with crocodile skin”. The local target text used: ”Crocodile skin”.

Global and Local Semantics

The following videos show how changing the local and global target texts affects the stylization. The first row of target texts shows the global target text and the second row shows the local target text: