Improving the Diffusability of Autoencoders
Abstract
Lay Summary
In recent years, image 🖼️ and video 📹 generation models have rapidly advanced 🚀, with both industry 🏢 and academia 🎓 investing heavily 💸. Most of these models follow the latent diffusion 🌬️ approach: an autoencoder 🤖 first compresses images or videos into a smaller latent space 🌀, and then a diffusion model is trained 🏋️ to generate samples in that space 🧪.So far, most work has focused on improving 🔧 the autoencoder’s reconstruction quality 🔍 and compression rate 📦. But our work shows 💡 that the choice of autoencoder has a deeper effect—it shapes 🧩 how well a diffusion model can generate realistic outputs 🎨. We call this diffusability ✨: how easy 😌 it is for a diffusion model to learn 📚 to generate in a given representation space 📈.Diffusion models build 🧱 images by gradually refining noise 🌫️, starting from a blurry outline and adding details ✏️ step by step 🔄. This process tends to struggle 😖 with high-frequency details 📶 (like textures 🧵 or fine edges ✂️), where errors ❌ can accumulate. Normally, the human eye 👁️ is less sensitive 🧘♂️ to these errors in pixel space 🧷. But we found 🧠 that some autoencoders place more emphasis 📣 on high frequencies in their latent space—more than RGB images do 🌈. As a result ⚠️, critical image structures 🏗️ get encoded in unstable 💥 high-frequency components, making them harder 😵 for the diffusion model to learn and sample correctly 🎯.To address this 🛠️, we introduce a simple training technique 📏: during autoencoder training, we downsample ⬇️ the latent representation and require the decoder to still produce a meaningful reconstruction 🛠️➡️🖼️. This encourages 🙌 the autoencoder to store important information ℹ️ in more robust 💪, low-frequency components 🧊.We show 🧪 that this small change 🔧 leads to large improvements 📈. It makes latent spaces more suitable ✅ for diffusion models, improving both image 🖼️ and video 📹 generation quality 🎯 on benchmarks like ImageNet 🧠 and Kinetics 🏃♂️.