Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue · Tianyu Xie · Tianyang Hu · Zijin Feng · Jiacheng Sun · Kenji Kawaguchi · Zhenguo Li · Zhi-Ming Ma

2025 Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Project Page [ Slides] [ OpenReview]

Abstract

Efficiently scaling Large Language Models (LLMs) necessitates exploring alternatives to dominant autoregressive (AR) methods, with Masked Diffusion Models (MDMs) emerging as candidates. However, comparing AR (typically decoder-only) and MDM (often encoder-only) paradigms is confounded by differing architectures, obscuring true algorithmic and efficiency trade-offs. This research decouples these factors by evaluating MDMs within a decoder-only framework to: (1) Equitably compare MDM (as Any-Order AR) and standard AR paradigms through discrepancies on orders. (2) Investigate MDM architectural impacts on computational efficiency. We show decoder-only MDMs, despite a larger modeling space, can achieve significant inference speedups ($\sim25\times$) and comparable perplexity with techniques like temperature annealing, offering a path to reduced inference compute. This work provides insights for developing more computationally efficient foundation models by disentangling core modeling choices from architectural influences. Code is available at \url{https://github.com/scxue/AO-GPT-MDM}.

Video

Chat is not available.