Skip to yearly menu bar Skip to main content


Poster

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu · Tianyang Hu · Wenjia Wang · Kenji Kawaguchi · Yuan Yao

Hall C 4-9 #413
[ ] [ Paper PDF ]
[ Poster
Tue 23 Jul 4:30 a.m. PDT — 6 a.m. PDT

Abstract:

As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench. The code is available at https://github.com/Pepper-lll/VLMinv.

Chat is not available.