Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #2506

ScalingAR: Scaling Confidence for Autoregressive Image Generation

Harold Haodong Chen ⋅ Xianfeng Wu ⋅ Wenjie Shu ⋅ Rongjin Guo ⋅ Disen Lan ⋅ Harry Yang ⋅ YINGCONG CHEN

Abstract

Test-time strategies have shown remarkable success in improving large language models, but their application to next-token prediction (NTP) autoregressive (AR) image generation remains largely underexplored. Existing test-time scaling (TTS) methods for visual autoregressive models (VAR) rely on frequent partial decoding and external reward models, which are inefficient and often ineffective for NTP-based image generation due to the inherent instability of intermediate decoding results. To address these limitations, we propose ScalingAR, a novel test-time scaling framework tailored for NTP-based AR image generation. ScalingAR introduces token entropy as a confidence signal and operates at two complementary levels: (i) Profile Level, integrates intrinsic uncertainty and conditional utilization into a unified confidence state, and (ii) Policy Level, leverages this state for adaptive trajectory pruning and dynamic guidance scheduling. Without requiring early decoding or auxiliary rewards, ScalingAR achieves significant improvements across diverse benchmarks. Experiments show that ScalingAR (I) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (II) reduces visual token consumption by 62.0% while outperforming baselines, and (III) enhances robustness, mitigating performance degradation by 26.0% in challenging scenarios. These results establish ScalingAR as a robust and efficient test-time scaling solution for autoregressive image generation.

Lay Summary

When artificial intelligence generates images, giving it extra "thinking time" to explore different drafts usually improves the final result. However, current methods for doing this in step-by-step image generation are slow and clumsy; they force the AI to constantly pause, fully render a messy, half-finished picture, and ask a separate "critic" program if it looks good. To fix this, we created ScalingAR, a system that allows the AI to rely on its own "gut feeling" or internal confidence. Instead of relying on external critics and early visual checks , ScalingAR constantly monitors the AI's internal uncertainty and how well it is remembering the user's prompt. If the AI intuitively senses a draft is getting too messy or drifting from the instructions, it immediately scraps that path and focuses its energy on the promising drafts. It also dynamically adjusts its focus to stay on track when it senses it might be wandering. The result is a much smarter and more efficient AI artist. By trusting its own internal signals, ScalingAR creates significantly better, more accurate images. Furthermore, it uses 62% less computational effort than older methods and proves to be highly reliable even when given incredibly complex or challenging requests.