Timezone: »

LLM-grounded Text-to-Image Diffusion Models
Long (Tony) Lian · Boyi Li · Adam Yala · Trevor Darrell
Event URL: https://llm-grounded-diffusion.github.io/ »

Recent advancements in text-to-image generation with diffusion models have yielded remarkable results in synthesizing highly realistic and diverse images. However, these models still encounter difficulties when it comes to generating images based on prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with a prompt describing the image to be generated, the LLM outputs a scene layout in the form of captioned bounding boxes along with a background caption. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that require both language and spatial reasoning. Furthermore, our method naturally supports dialog-based scene specification and is able to handle prompts in languages that are not well-supported by the underlying diffusion model.

Author Information

Long (Tony) Lian (UC Berkeley)
Boyi Li (UC Berkeley / NVIDIA)
Adam Yala (UC Berkeley and UCSF)
Trevor Darrell (University of California at Berkeley)

More from the Same Authors