ScalingAR: Scaling Confidence for Autoregressive Image Generation
Abstract
Test-time strategies have shown remarkable success in improving large language models, but their application to next-token prediction (NTP) autoregressive (AR) image generation remains largely underexplored. Existing test-time scaling (TTS) methods for visual autoregressive models (VAR) rely on frequent partial decoding and external reward models, which are inefficient and often ineffective for NTP-based image generation due to the inherent instability of intermediate decoding results. To address these limitations, we propose ScalingAR, a novel test-time scaling framework tailored for NTP-based AR image generation. ScalingAR introduces token entropy as a confidence signal and operates at two complementary levels: (i) Profile Level, integrates intrinsic uncertainty and conditional utilization into a unified confidence state, and (ii) Policy Level, leverages this state for adaptive trajectory pruning and dynamic guidance scheduling. Without requiring early decoding or auxiliary rewards, ScalingAR achieves significant improvements across diverse benchmarks. Experiments show that ScalingAR (I) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (II) reduces visual token consumption by 62.0% while outperforming baselines, and (III) enhances robustness, mitigating performance degradation by 26.0% in challenging scenarios. These results establish \ourmethod as a robust and efficient test-time scaling solution for autoregressive image generation.