ArcDAE: Asymmetric Rectified Contrastive Diffusion Autoencoder for Unified Representation Learning
Abstract
The unification of generative details and discriminative semantics presents a structural paradox in \textit{diffusion-based representation learning}. Early approaches decouple semantics from generation, inevitably compromising representational completeness (i.e., \textit{information split}). While recent bridge-based methods achieve unification via a tightly coupled mapping, they suffer from \textit{information overload}. This is because unconstrained reconstruction objectives incentivize the encoder to entangle high-frequency stochastic noise into the latent bottleneck. To solve this, we introduce \textit{asymmetric rectified contrastive diffusion autoencoder} (ArcDAE), which rebuilds the diffusion bridge as a \textit{dynamic sifter}. Through imposing a \textit{timestep-aware rectification constraint} that orthogonalizes the semantic manifold from the stochastic noise space, ArcDAE compels the bottleneck to distill discriminative features while actively shedding high-frequency redundancy. Consequently, our approach eliminates the overload trap without reverting to decoupling. Extensive experiments validate the superiority of our FFHQ-trained ArcDAE, surpassing state-of-the-art methods by up to 6.4\% in downstream semantics regression and 9.7\% in reconstruction fidelity.