Poster

Improving the Diffusability of Autoencoders

Ivan Skorokhodov ⋅ Sharath Girish ⋅ Benran Hu ⋅ Willi Menapace ⋅ Yanyu Li ⋅ Rameen Abdal ⋅ Sergey Tulyakov ⋅ Aliaksandr Siarohin

2025 Poster

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to $20$K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256. The source code is available at https://github.com/snap-research/diffusability.

Lay Summary

In recent years, image 🖼️ and video 📹 generation models have rapidly advanced 🚀, with both industry 🏢 and academia 🎓 investing heavily 💸. Most of these models follow the latent diffusion 🌬️ approach: an autoencoder 🤖 first compresses images or videos into a smaller latent space 🌀, and then a diffusion model is trained 🏋️ to generate samples in that space 🧪.So far, most work has focused on improving 🔧 the autoencoder’s reconstruction quality 🔍 and compression rate 📦. But our work shows 💡 that the choice of autoencoder has a deeper effect—it shapes 🧩 how well a diffusion model can generate realistic outputs 🎨. We call this diffusability ✨: how easy 😌 it is for a diffusion model to learn 📚 to generate in a given representation space 📈.Diffusion models build 🧱 images by gradually refining noise 🌫️, starting from a blurry outline and adding details ✏️ step by step 🔄. This process tends to struggle 😖 with high-frequency details 📶 (like textures 🧵 or fine edges ✂️), where errors ❌ can accumulate. Normally, the human eye 👁️ is less sensitive 🧘♂️ to these errors in pixel space 🧷. But we found 🧠 that some autoencoders place more emphasis 📣 on high frequencies in their latent space—more than RGB images do 🌈. As a result ⚠️, critical image structures 🏗️ get encoded in unstable 💥 high-frequency components, making them harder 😵 for the diffusion model to learn and sample correctly 🎯.To address this 🛠️, we introduce a simple training technique 📏: during autoencoder training, we downsample ⬇️ the latent representation and require the decoder to still produce a meaningful reconstruction 🛠️➡️🖼️. This encourages 🙌 the autoencoder to store important information ℹ️ in more robust 💪, low-frequency components 🧊.We show 🧪 that this small change 🔧 leads to large improvements 📈. It makes latent spaces more suitable ✅ for diffusion models, improving both image 🖼️ and video 📹 generation quality 🎯 on benchmarks like ImageNet 🧠 and Kinetics 🏃♂️.

Video

Chat is not available.