ICML 2026 Orals

Skip to yearly menu bar Skip to main content

Oral

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Sagnik Mukherjee ⋅ Lifan Yuan ⋅ Pavan Jayasinha ⋅ Dilek Hakkani-Tür ⋅ Hao Peng

Jul 7, 10:00 AM - 10:15 AM HALL C

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token-prediction stages (e.g., pretraining and supervised fine-tuning), despite the fundamental differences between RL and these stages emphasized by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rate of AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam’s per-parameter adaptive learning rates and momentum. Confirming our hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model without any sparsity-promoting regularization, more than 1,000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. Our findings provide fresh insights into the optimization dynamics of RL in LLMs and demonstrate that RL can be substantially more parameter-efficient than previously recognized.

View full details

Oral

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Arnav Shah ⋅ Junzhe Li ⋅ Parsa Idehpour ⋅ Adibvafa Fallahpour ⋅ Brandon Wang ⋅ Sukjun Hwang ⋅ BO WANG ⋅ Patrick Hsu ⋅ Hani Goodarzi ⋅ Albert Gu

Jul 7, 10:00 AM - 10:15 AM HALL D2

Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff. Standard subword tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end to end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

View full details

Oral

DiScoFormer: Plug-In Density and Score Estimation with Transformers

Vasily Ilin ⋅ Peter Sushko ⋅ Ranjay Krishna

Jul 7, 10:00 AM - 10:15 AM HALL D1

Estimating probability density and its score from samples remains a core problem in generative modeling, Bayesian inference, and kinetic theory. Existing methods are bifurcated: classical kernel density estimators (KDE) generalize across distributions but suffer from the curse of dimensionality, while modern neural score models achieve high precision but require retraining for every target distribution. We introduce DiScoFormer (Density and Score Transformer), a ``train-once, infer-anywhere" equivariant Transformer that maps i.i.d. samples to both density values and score vectors, generalizing across distributions and sample sizes. Analytically, we prove that self-attention can recover normalized KDE, establishing it as a functional generalization of kernel methods; empirically, individual attention heads learn multi-scale, kernel-like behaviors. The model converges faster and achieves higher precision than KDE for density estimation, and provides a high-fidelity plug-in score oracle for score-debiased KDE, Fisher information computation, and Fokker-Planck-type PDEs.

View full details

Oral

Position: Don't Just "Fix it in Post'': A Science of AI Must Study Learning Dynamics

Stella Biderman ⋅ Mohammad Aflah Khan ⋅ Niloofar Mireshghallah ⋅ Catherine Arnett ⋅ Fazl Barez ⋅ Naomi Saphra

Jul 7, 10:00 AM - 10:15 AM GRAND BALLROOM 101-105

What would it mean to have a *scientific* understanding of AI? Language models are not static objects—they are snapshots of time-evolving processes shaped by data, objectives, and optimization dynamics. Yet the field predominantly treats models as fixed artifacts, analyzing behaviors after training rather than asking *why* they emerge. **This position paper argues that AI research should move beyond *post hoc* fixes and study the learning dynamics of models.** We envision a hierarchy of scientific maturity: first *predict* outcomes from early training signals, then *intervene* when trajectories go wrong, ultimately *design* training procedures that guarantee desired properties. Scaling laws have reached the first level for loss; the challenge is extending all three levels to general capabilities, biases, and safety. We articulate requirements for such theories, survey progress across mechanistic interpretability, fairness, memorization, and learning dynamics, and identify concrete open problems. The path forward requires treating models as processes to be understood, not just artifacts to be patched.

View full details

Oral

Benchmarking at the Edge of Comprehension

Samuele Marro ⋅ Jialin Yu ⋅ Emanuele La Malfa ⋅ Oishi Deb ⋅ Jiawei Li ⋅ Yibo Yang ⋅ Ebey Abraham ⋅ Sunando Sengupta ⋅ Eric Sommerlade ⋅ Michael Wooldridge ⋅ Phil Torr

Jul 7, 10:00 AM - 10:15 AM HALL B2

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the *post-comprehension regime*. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of *critique-resilient correctness*: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

View full details

Oral

Asymmetric Perturbation in Solving Bilinear Saddle-Point Optimization

Kenshi Abe ⋅ Mitsuki Sakamoto ⋅ Kaito Ariu ⋅ Atsushi Iwasaki

Jul 7, 10:00 AM - 10:15 AM ASEM BALLROOM 201-203

This paper proposes asymmetric perturbation, where only one player's payoff function is perturbed, for solving bilinear saddle-point optimization problems, commonly arising in minimax problems, game theory, and constrained optimization. Symmetric perturbation is known to require decreasing its strength to ensure convergence to a solution, i.e., an equilibrium in the original game, resulting in a slower rate. First, with asymmetric perturbation, we show that, for a sufficiently small perturbation strength, the equilibrium strategy of the asymmetrically perturbed game coincides with an equilibrium strategy of the original unperturbed game. Second, building on this coincidence, we construct a learning algorithm with a linear last-iterate convergence rate. Third, motivated by the fact that the coincidence relies on the perturbation strength being sufficiently small, we also provide a parameter-free variant, retaining the linear rate. Finally, we empirically demonstrate fast convergence toward equilibria in both normal-form and extensive-form games.

View full details

Oral

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Chufan Shi ⋅ Cheng Yang ⋅ Yaokang Wu ⋅ Linghao Jin ⋅ Bo Shui ⋅ Taylor Berg-Kirkpatrick ⋅ Xuezhe Ma

Jul 7, 10:00 AM - 10:15 AM AUDITORIUM

Vision-Language Models (VLMs) often produce self-reflective statements like “let me check the figure again” during reasoning. Do such state- ments trigger genuine visual re-examination, or are they merely learned textual patterns? We in- vestigate this via VISUALSWAP, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-BENCH, 800 image pairs curated from MathVista, Math- Verse, MathVision, and MMMU-Pro. Exper- iments on Qwen3-VL, Kimi-VL, and ERNIE- VL reveal a striking failure: models overwhelm- ingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking mod- els are nearly 3x more vulnerable than their in- structed counterparts, and scaling offers no mit- igation. Multi-turn user instructions restore vi- sual grounding, but self-generated reflective state- ments during continuous generation do not. At- tention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io/

View full details

Oral

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He ⋅ Chaoyi Wang ⋅ Peng TANG ⋅ Yifan Yang ⋅ Xiaobin Hu

Jul 7, 10:15 AM - 10:30 AM AUDITORIUM

Video subtitle removal is essential for content localization and media re-editing, yet existing mask-guided diffusion methods face critical limitations: training inefficiency requiring extensive annotations and full model fine-tuning, inference complexity demanding explicit mask sequences, and static prior utilization unable to adapt to quality variations. We present CLEAR (Context-aware Learning for End-to-end Adaptive subtitle Removal), a lightweight adapter-based framework addressing these challenges through three technical innovations. First, self-supervised prior learning (Stage I) extracts occlusion guidance from video pairs using pixel differences as weak supervision, eliminating annotation dependency while learning generalizable subtitle features across languages. Second, LoRA-based adaptive refinement (Stage II) enables parameter-efficient training that preserves pre-trained visual priors while achieving true mask-free end-to-end inference without external detection modules. Third, adaptive focal weighting dynamically adjusts prior influence based on local quality assessment, effectively handling diverse subtitle styles and noisy guidance signals. Extensive experiments demonstrate CLEAR's superior performance in multilingual subtitle removal while requiring only 0.77% trainable parameters, establishing a new paradigm for efficient video text removal without inference-time mask dependencies.

View full details

Oral

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng ⋅ Dayuan Fu ⋅ Tiantian Mi ⋅ Zhuang Yumin ⋅ Yaxing Huang ⋅ Xuefeng Li ⋅ Lyumanshan Ye ⋅ Muhang Xie ⋅ Qishuo Hua ⋅ Zhen Huang ⋅ Mohan Jiang ⋅ Hanning Wang ⋅ Shijie Xia ⋅ Yang Xiao ⋅ Jie Sun ⋅ Yunze Wu ⋅ Pengfei Liu

Jul 7, 10:15 AM - 10:30 AM HALL B2

While the emerging field of agentic software engineering has spurred extensive research into post-training, this paradigm alone does not fully address the distribution mismatch between traditional static pre-training and dynamic deployment environments. In this paper, we instead investigate agentic mid-training as a scalable complementary approach. Central to our approach is *agent-native data* comprising two complementary components: *contextually-native trajectories* that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and *environmentally-native trajectories* whose observations stem from actual tool invocations and test executions, providing interaction authenticity. On `SWE-Bench Verified`, our recipe outperforms the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with the same base model and agentic scaffold, while using fewer than half mid-training tokens (73.1B). Furthermore, our 32B and 72B models achieve state-of-the-art resolution rates of **56.1\%** and **58.5\%** among open agentic recipes using agentic scaffolds, despite starting from non-coder `Qwen2.5` base models. We also observe performance gains on general code generation and scientific benchmarks. We open-source a significant portion of our datasets, recipes, and model checkpoints to facilitate further research.

View full details

Oral

Learning Unmasking Policies for Diffusion Language Models

Metod Jazbec ⋅ Theo X. Olausson ⋅ Louis Béthune ⋅ Pierre Ablin ⋅ Michael Kirchhof ⋅ Joao Monteiro ⋅ Victor Guilherme Turrisi da Costa ⋅ Jason Ramapuram ⋅ Marco Cuturi

Jul 7, 10:15 AM - 10:30 AM HALL C

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the \textit{sampling procedure} that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting. Our code is available at [https://github.com/apple/ml-rl-dllm](https://github.com/apple/ml-rl-dllm).

View full details

Oral

FLIP2: Expanding Protein Fitness Landscape Benchmarks for Real-World Machine Learning Applications

Kieran Didi ⋅ Sarah Alamdari ⋅ Alex Lu ⋅ Bruce Wittmann ⋅ Kadina Johnston ⋅ Ava Amini ⋅ Ali Madani ⋅ Maya Czeneszew ⋅ Christian Dallago ⋅ Kevin Yang

Jul 7, 10:15 AM - 10:30 AM HALL D2

Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of stability, binding, and viral capsid viability. We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, protein-protein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and suites reveals that simpler models often matched or outperformed fine-tuned protein language models on FLIP2, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data CC-BY 4.0 to facilitate continued progress.

View full details

Oral

Mixtures Closest To A Given Measure: A Semidefinite Programming Approach

Srećko Ðurašinović ⋅ Jean B Lasserre ⋅ Victor Magron

Jul 7, 10:15 AM - 10:30 AM ASEM BALLROOM 201-203

Mixture models, such as Gaussian mixture models (GMMs), are widely used in machine learning to represent complex data distributions. A key challenge, especially in high-dimensional settings, is to determine the mixture order and estimate the mixture parameters. We study the problem of approximating a target measure, available only through finitely many of its moments, by a mixture of distributions from a parametric family (e.g., Gaussian, exponential, Poisson), with approximation quality measured by the 2-Wasserstein ($\operatorname{W_2}$) or the total variation ($\operatorname{TV}$) distance. Unlike many existing approaches, the parameter set is not assumed to be finite; it is modeled as a compact basic semi-algebraic set. We introduce a hierarchy of semidefinite relaxations with asymptotic convergence to the desired optimal value. In addition, when a certain rank condition is satisfied, the convergence is even finite and recovery of an optimal mixing measure is obtained. We also present an application to clustering, where our framework serves either as a stand-alone method or as a preprocessing step that yields both the number of clusters and strong initial parameter estimates, thereby accelerating convergence of standard (local) clustering algorithms

View full details

Oral

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

Ishaan Singh Chandok ⋅ Core Francisco Park

Jul 7, 10:15 AM - 10:30 AM GRAND BALLROOM 101-105

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the “last mile” problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

View full details

Oral

LASER: Learning Active Sensing for Continuum Field Reconstruction

Huayu Deng ⋅ Jinghui Zhong ⋅ Xiangming Zhu ⋅ Yunbo Wang ⋅ Xiaokang Yang

Jul 7, 10:15 AM - 10:30 AM HALL D1

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

View full details

Oral

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Lukasz Borchmann ⋅ Jordy Van Landeghem ⋅ Michał Turski ⋅ Shreyansh Padarha ⋅ Ryan Kearns ⋅ Adam Mahdi ⋅ Niels Rogge ⋅ Clémentine Fourrier ⋅ Siwei Han ⋅ Huaxiu Yao ⋅ Artemis Llabrés ⋅ Yiming Xu ⋅ Dimosthenis Karatzas ⋅ Hao Zhang ⋅ Anupam Datta

Jul 7, 10:30 AM - 10:45 AM HALL B2

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behavior, we introduce a novel protocol that measures the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20\% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

View full details

Oral

Protein Autoregressive Modeling via Multiscale Structure Generation

Yanru Qu ⋅ Cheng-Yen Hsieh ⋅ Zaixiang Zheng ⋅ Ge Liu ⋅ Quanquan Gu

Jul 7, 10:30 AM - 10:45 AM HALL D2

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

View full details

Oral

AI Engram: In Search of Memory Traces in Artificial Intelligence

Jea Kwon ⋅ Dong-Kyum Kim ⋅ Jiwon Kim ⋅ Yonghyun Kim ⋅ Woong Kook ⋅ MEEYOUNG CHA

Jul 7, 10:30 AM - 10:45 AM GRAND BALLROOM 101-105

Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such “AI engrams” by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

View full details

Oral

Motion Attribution for Video Generation

Xindi Wu ⋅ Despoina Paschalidou ⋅ Jun Gao ⋅ Antonio Torralba ⋅ Laura Leal-Taixé ⋅ Olga Russakovsky ⋅ Sanja Fidler ⋅ Jonathan Lorraine

Jul 7, 10:30 AM - 10:45 AM AUDITORIUM

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, we improve both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

View full details

Oral

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Zhongzhi Li ⋅ Xuansheng Wu ⋅ Yijiang Li ⋅ Lijie Hu ⋅ Ninghao Liu

Jul 7, 10:30 AM - 10:45 AM HALL C

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce ***Feature Activation Coverage* (FAC)** which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named **FAC Synthesis**, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

View full details

Oral

Multimodal Nested Learning for Decoupled and Coordinated Optimization

Yanglin Feng ⋅ Yang Qin ⋅ Dezhong Peng ⋅ Rui Wang ⋅ Xiaomin Song ⋅ Peng Hu

Jul 7, 10:30 AM - 10:45 AM HALL D1

Multimodal learning aims to integrate multi-sensor data to exploit their complementary information, embracing a more comprehensive real-world perception and understanding. However, heterogeneous discrepancies across modalities consistently trigger imbalanced multimodal optimization, restricting the joint learning performance. Although existing methods mitigate this issue through optimization modulation and conflict alleviation, they still suffer from entangled optimization and uniform learning pace in conventional monolithic frameworks, limiting the effectiveness of multimodal learning. To address this issue, we propose a novel Multimodal Nested Learning Framework (MoNet), which reformulates the monolithic framework into nested sub-processes, decoupling and coordinating multimodal learning. To achieve this, we present a Decoupled Multimodal Stable Memory block (DMSM) as the outermost nested level, which decouples multimodal learning into independent optimization streams for semantic exploitation across modalities. Additionally, we develop an Adaptive Multimodal Coordinated Fusion block (AMCF), which constitutes the inner nested level. It attempts to coordinate multimodal information integration across multi-timescale nested memories, balancing multimodal fusion. Extensive experimental results on eight datasets across three tasks demonstrate the superiority of MoNet. Code is available at https://github.com/Yangl1nFeng/MoNet.

View full details

Oral

On the Convergence Rate of LoRA Gradient Descent

Siqiao Mu ⋅ Diego Klabjan

Jul 7, 10:30 AM - 10:45 AM ASEM BALLROOM 201-203

The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two "adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the *original LoRA gradient descent* algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the "Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations.

View full details

Oral

Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

Jacob Bamberger ⋅ Adam Gosztolai ⋅ Pierre Vandergheynst ⋅ Michael Bronstein ⋅ Iolo Jones

Jul 7, 10:45 AM - 11:00 AM HALL D1

High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose **Riemannian metric matching**: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the *carré du champ* operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of $k$-NN-based diffusion geometry estimators, while enabling amortized inference that is up to $400\times$ faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

View full details

Oral

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Yichen Gong ⋅ Zhuohan Cai ⋅ Sunhao Dai ⋅ Yuqi Zhou ⋅ Zhangxuan Gu ⋅ Changhua Meng ⋅ Shuheng Shen

Jul 7, 10:45 AM - 11:00 AM HALL B2

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.

View full details

Oral

Guaranteed Optimal Compositional Explanations for Neurons

Biagio La Rosa ⋅ Leilani Gilpin

Jul 7, 10:45 AM - 11:00 AM GRAND BALLROOM 101-105

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts assumptions related to the structure of the combinations and beam search to restrict the state space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations over the entire state space spanned by the adopted assumptions. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations in a time comparable to exhaustive beam search. Using this framework, we demonstrate that 10-40\% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

View full details

Oral

PhotoAgent: Exploratory Visual Aesthetic Planning with Large Vision Models

Mingde Yao ⋅ Zhiyuan You ⋅ King-Man Tam ⋅ Menglu Wang ⋅ Tianfan Xue

Jul 7, 10:45 AM - 11:00 AM AUDITORIUM

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent significantly outperforms existing methods in both instruction faithfulness and visual quality across a diverse range of editing scenarios.

View full details

Oral

Revenue Guarantees of No-Swap-Regret Dynamics in First Price Auctions

Anders Bo Ipsen ⋅ Stratis Skoulakis

Jul 7, 10:45 AM - 11:00 AM ASEM BALLROOM 201-203

We study the revenue of approximate correlated equilibrium in discrete first price auctions - the set of allowable bids is $\mathcal{B} = \{0, 1/k, \dots, 1 - 1/k, 1\}$ for some $k \in \mathbb{N}$. We show that the revenue of any $\epsilon$-\textit{approximate} correlated equilibrium is at least $v_2 - \Theta(1/k)- \Theta(\epsilon k^2)$, where $v_2 \geq 0$ is the second-highest valuation. Our results establish the first polynomial convergence rates on the revenue generated by no-swap regret bidders in first-price auctions. For instance, if bidders admit the optimal swap regret of $\mathcal{O}(\sqrt{k T})$, then the time-averaged revenue is at least $v_2 - \Theta(1/k) - \Theta(\epsilon)$ after $\mathcal{O}(k^5/\epsilon^2)$ rounds.

View full details

Oral

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang ⋅ Xuan Ouyang ⋅ Tianyi Xu ⋅ Yuzheng Hu ⋅ Jialin Liu ⋅ Guo Chen ⋅ Tianyu Zhang ⋅ Junhao Zheng ⋅ Kexin Yang ⋅ Xingzhang Ren ⋅ Dayiheng Liu ⋅ Linfeng Zhang

Jul 7, 10:45 AM - 11:00 AM HALL C

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall—LLM pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. It also outperforms previous data selection methods across different stages of training, including from-scratch pre-training and also mid-training. Beyond online selection, the OPUS utility score also demonstrates potential as a static filter for flagging and removing toxic documents from contaminated training corpora prior to training.

View full details

Oral

Protein Fold Classification at Scale: Benchmarking and Pretraining

Dexiong Chen ⋅ Andrei Manolache ⋅ Mathias Niepert ⋅ Karsten Borgwardt

Jul 7, 10:45 AM - 11:00 AM HALL D2

Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of large-scale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to $90\%$ with an $\mathrm{SE(3)}$-invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH v4.4. TEDBench is available at https://github.com/BorgwardtLab/TEDBench.

View full details

Oral

Controlled LLM Training on Spectral Sphere

Tian Xie ⋅ Haoming Luo ⋅ Haoyu Tang ⋅ Hu Yiwen ⋅ Jason Liu ⋅ Qingnan Ren ⋅ Yang Wang ⋅ Xin Zhao ⋅ Rui Yan ⋅ Bing Su ⋅ Chong Luo ⋅ Baining Guo

Jul 7, 1:30 PM - 1:45 PM HALL C

Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only "half-aligned" with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the **Spectral Sphere Optimizer (SSO)**, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large‑scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

View full details

Oral

Don't Force the Fit: Bounded Log-Likelihood Loss for Enhanced Reasoning in Large Language Models

Feng Zhao ⋅ Hong Zhang ⋅ Yu Yang ⋅ Ruilin Zhao ⋅ Guandong Xu

Jul 7, 1:30 PM - 1:45 PM HALL B2

Supervised fine-tuning (SFT) is central to aligning large language models (LLMs) with instruction following and task-specific reasoning. Despite its success, SFT optimizes token-level likelihoods under the implicit assumption that strictly fitting all tokens in expert demonstrations induces the desired downstream behavior. However, in reasoning tasks where correctness is defined by logical validity or final outcomes rather than exact token realizations, this assumption can lead to optimization misalignment. We empirically observe that low-probability tokens in reasoning demonstrations often correspond to realization-specific or stylistic variations, and that reducing their influence during training consistently improves generalization on reasoning benchmarks. Motivated by this insight, we propose the *Bounded Log-Likelihood Loss* (BLL-Loss), a simple and parameter-free alternative to standard likelihood training that bounds gradient contributions from low-probability tokens while preserving conventional optimization behavior. We provide theoretical insights and extensive empirical results demonstrating that BLL-Loss improves reasoning generalization across diverse model scales and challenging benchmarks.

View full details

Oral

From Feasible to Practical: Pareto-Optimal Synthesis Planning

Friedrich Hastedt ⋅ Dongda Zhang ⋅ Antonio Del rio chanona

Jul 7, 1:30 PM - 1:45 PM HALL D1

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro$^\ast$, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs between user-defined criteria. MORetro$^\ast$ uses weighted scalarization and solution-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A$^\ast$-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro$^\ast$ recovers the true Pareto front under admissibility. Across multiple retrosynthesis benchmarks, MORetro$^\ast$ produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

View full details

Oral

Diffract: Spectral View of LLM Domain Adaptation

Nikita Borodin ⋅ Maria Krylova ⋅ Artem Zabolotnyi ⋅ Dmitry Aspisov ⋅ Egor Shikov ⋅ Nikita Tyuplyaev ⋅ Oleg Travkin ⋅ Roman Alferov ⋅ Dmitry Vinichenko

Jul 7, 1:30 PM - 1:45 PM ASEM BALLROOM 201-203

We study continual pre-training (CPT) as a mechanism for adapting general-purpose large language models to specialized domains: mathematics, instruction, code, and natural text. Using singular value decomposition of weight matrices, we find that CPT leaves singular value spectra largely invariant, with adaptation driven mainly by changes in singular vectors. An analysis of attention-head projection matrices reveals strong, domain-dependent **head heterogeneity**, which we exploit to define a head importance criterion: up to **60%** of head updates can be removed without measurable quality loss. Selectively rewinding low-importance heads to their pre-trained state improves benchmark accuracy by up to **4%** versus the fully trained baseline. Finally, we identify **domain connectivity**—linear interpolation between CPT checkpoints yields smooth domain-quality interpolation without notable degradation on either domain—and release Diffract, an open-source toolkit for scalable spectral analysis of billion-parameter models.

View full details

Oral

Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks

Stefan Huber ⋅ Hannes Unger ⋅ Georg Schäfer ⋅ Jakob Rehrl

Jul 7, 1:30 PM - 1:45 PM GRAND BALLROOM 101-105

We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as drop-in replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 277 times fewer parameters, fostering sample efficiency, explainability and realtime capability. Chebyshev policies are evaluated on further RL tasks, including a real-world nonlinear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.

View full details

Oral

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

Vansh Gupta ⋅ Peter Nutter ⋅ Samuel Stante ⋅ Andreas Krause ⋅ Florian Tramer ⋅ Lukas Fluri ⋅ Xin Chen ⋅ Anna Hedström

Jul 7, 1:30 PM - 1:45 PM HALL D2

We argue that many Anthropomorphized Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets and experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.

View full details

Oral

Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

Hao Zhang ⋅ Yaru Niu ⋅ Yikai Wang ⋅ Ding Zhao ⋅ Eric Tseng

Jul 7, 1:30 PM - 1:45 PM AUDITORIUM

To improve generalization and resilience in human–robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

View full details

Oral

Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges

Usman A Khan ⋅ Joseph Durham

Jul 7, 1:45 PM - 2:00 PM AUDITORIUM

We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and yields min-cost, integral~$(\{0,1\})$ transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

View full details

Oral

Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

Naïm Es-sebbani ⋅ Esteban Marquer ⋅ Yakoub Salhi ⋅ Zied Bouraoui

Jul 7, 1:45 PM - 2:00 PM HALL B2

Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2-CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.

View full details

Oral

ReViT: Rotational-equivariant Vision Transformers for Neural PDE Solvers

Hao Wei ⋅ Björn List ⋅ Nils Thuerey

Jul 7, 1:45 PM - 2:00 PM HALL D1

Physics obeys strict symmetries like rotational equivariance. However, the standard Transformer architectures widely used in physics foundation models do not enforce these constraints by construction. We introduce ReViT, a rotationally equivariant Vision Transformer framework for neural PDE solvers operating on grid-based physical fields that achieves exact equivariance for the discrete groups $C_4$ (2D) and the chiral octahedral group $O$ (3D), with bounded approximate $\mathrm{SO}(d)$ equivariance for continuous rotations. ReViT maps scalar and vector inputs into locally invariant representations derived from physics-based canonical bases, enabling the use of standard self-attention without symmetry violations. Built on a hierarchical Swin-style backbone with a precomputed reference basis pyramid, ReViT preserves equivariance across multi-scale operations. We evaluate ReViT on a wide range of 2D and 3D PDE benchmarks, such as Magnetohydrodynamics and Turbulent Channel Flows, demonstrating significant gains over state-of-the-art baselines. ReViT exhibits strong generalization, and reduces MSE by up to 65\% compared with the best-performing alternatives.

View full details

Oral

Foundations of Equivariant Deep Learning: Unifying Graph and Sheaf Neural Networks

Yoshihiro Maruyama

Jul 7, 1:45 PM - 2:00 PM ASEM BALLROOM 201-203

Symmetry is everywhere in nature and society. Geometric deep learning exploits symmetries in data to improve the performance and efficiency of deep learning systems. In this paper, we extend geometric deep learning to utilize richer symmetry structures. Specifically, we develop order-equivariant neural networks (OENN), which generalize standard graph message passing and sheaf neural networks via the theory of equivariant bundles over face posets (face categories). We (i) characterize all linear order-equivariant maps, (ii) build OENN layers, and (iii) prove universal approximation theorems (UATs) for continuous order-equivariant maps, which are new results even when restricted to sheaf neural networks (for which no UAT was known before). We illustrate the framework on graph and sheaf models. Our results can also be seen as extending the known UAT for graph neural networks to a more general setting that subsumes sheaf neural networks as well. In addition, we show that OENN can be extended further to CENN, Category-Equivariant Neural Network, which gives the general form of equivariant neural networks as well as of equivariant universal approximation theorems, allowing us to leverage categorical symmetry in data (e.g., non-invertible symmetries on multiple objects with compositional relations on those symmetries).

View full details

Oral

Monitoring Monitorability

Melody Guan ⋅ Miles Wang ⋅ Micah Carroll ⋅ Zehao Dou ⋅ Annie Wei ⋅ Marcus Williams ⋅ Benjamin Arnav ⋅ Joost Huizinga ⋅ Ian Kivlichan ⋅ Amelia Glaese ⋅ Jakub Pachocki ⋅ Bowen Baker

Jul 7, 1:45 PM - 2:00 PM HALL D2

Safe deployment of increasingly capable AI agents may require visibility into how they make decisions. Chain-of-thought (CoT) monitoring can detect misbehavior in today’s reasoning models, but this “monitorability” may be fragile under different training procedures, data sources, or continued system scaling. We propose three evaluation archetypes (intervention, process, and outcome-property), a new monitorability metric, and a broad evaluation suite. We show CoT monitoring outperforms action-only monitoring in practical settings, and that frontier models are generally—but not perfectly—monitorable. We study scaling trends with pre-training model size and inference-time compute, finding longer CoTs are typically more monitorable. We find that, for a fixed capability level, using a smaller model at higher reasoning effort can yield higher monitorability, at greater inference compute cost. We further find that increasing a weak monitor’s test-time compute when monitoring a strong agent improves monitorability, and giving the monitor access to the CoT both boosts monitorability and steepens the compute–to-monitorability scaling trend. Finally, we show monitorability can be improved by asking follow-up questions and giving the follow-up CoT to the monitor.

View full details

Oral

Detecting the Semantic Fixed Point: A Geometric Framework for Efficient Inference

Jiawei Gu ⋅ Ziyue Qiao ⋅ Xiao Luo

Jul 7, 1:45 PM - 2:00 PM HALL C

Each layer of a Transformer refines the hidden state toward a prediction, an iterative process resembling fixed-point iteration. Yet when should this iteration terminate? Existing early exit methods rely on output confidence as a proxy for internal convergence. We take a more direct approach by examining the geometry of the hidden state trajectory. We find that layer-wise updates exhibit a two-phase structure: large, volatile updates in early layers, followed by small, aligned updates as the model propagates an already-formed representation. The transition is remarkably sharp. This yields a simple criterion: exit when step size vanishes and direction stabilizes. We track the normalized update norm and cosine similarity between consecutive updates, exiting when both indicate convergence. The overhead is $O(d)$ per layer, independent of vocabulary size, requiring no learned components or architectural modifications. On LLaMA-2-7B and LLaMA-2-13B across question answering and commonsense reasoning tasks, this geometric criterion reduces FLOPs by 30--35\% while retaining over 98\% of full-depth accuracy.

View full details

Oral

Maximum Likelihood Reinforcement Learning

Fahim Tajwar ⋅ Guanning Zeng ⋅ Yueer Zhou ⋅ Yuda Song ⋅ Daman Arora ⋅ Yiding Jiang ⋅ Jeff Schneider ⋅ Russ Salakhutdinov ⋅ Haiwen Feng ⋅ Andrea Zanette

Jul 7, 1:45 PM - 2:00 PM GRAND BALLROOM 101-105

Reinforcement learning (RL) is the method of choice for training models in setups where the objective function can only be evaluated by sampling from the model. Our key observation is that when the feedback is terminal and binary, models implicitly induce a likelihood over correct rollouts. Maximum likelihood would be the natural framework in such settings, but RL is used instead as a workaround to the non-differentiability. We prove that the standard, expected-reward RL formulation is only a first-order approximation of the likelihood. To remedy this mismatch, we introduce **Maximum Likelihood Reinforcement Learning (MaxRL)**, a compute-indexed family of sample-based objectives that interpolate between expected-reward RL and maximum likelihood as sampling compute is scaled. The resulting objective is a one-line change to standard RL implementations. MaxRL Pareto-dominates existing methods in all tested models and tasks, achieves up to $\mathbf{20\times}$ gains in test-time scaling efficiency over GRPO, and scales more favorably with additional training data and compute.

View full details

Oral

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Mohammad Taufeeque ⋅ Stefan Heimersheim ⋅ Adam Gleave ⋅ Chris Cundy

Jul 7, 2:00 PM - 2:15 PM HALL D2

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) *Obfuscated activations*: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) *Obfuscated policy*: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The detector penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

View full details

Oral

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Xinyi Hu ⋅ Yuhao Shen ⋅ Zhang Baolin ⋅ Hengxin Zhang ⋅ Jun Dai ⋅ Shuang Ge ⋅ Chen Lei ⋅ Yue Li ⋅ Mingcheng Wan

Jul 7, 2:00 PM - 2:15 PM HALL C

Speculative Decodin promises to accelerate Large Language Model inference, yet its efficacy often degrades in production-grade scenarios. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales—particularly the industrial-grade Qwen3-235B—demonstrate that ECHO consistently outperforms state-of-the-art baselines in both low-load and high-load scenarios, achieving up to 5.35$\times$ walltime speedup and delivering over 20\% relative speedup gain against the strongest baselines.

View full details

Oral

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee ⋅ Kevin Jamieson

Jul 7, 2:00 PM - 2:15 PM GRAND BALLROOM 101-105

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

View full details

Oral

TG-RAG: A Retrieval-Augmented Framework for Reasoning Guidance in Specialized Domains

Liang Su ⋅ Mingyang Zhang ⋅ Yun Xiong ⋅ Tengfei LIU ⋅ Siwei Zhang ⋅ Xi Chen ⋅ Li Sun

Jul 7, 2:00 PM - 2:15 PM HALL B2

Enhancing Large Reasoning Models (LRMs) for specialized domains remains a critical challenge. While recent industrial frameworks attempt to encapsulate Standard Operating Procedures into modular "skills" for dynamic retrieval, utilizing them via context engineering often proves insufficient for complex workflows, leading to "Cognitive Drift." To mitigate this, we propose $\textbf{Thought Guidance-Retrieval Augmented Generation (TG-RAG)}$, a Retrieval-Augmented framework that effectively steers the generation process without relying solely on the model's self-correction. Built upon an Expert Procedure Graph (EPG) that formalizes unstructured SOPs, the framework uniquely employs a dynamic $\textbf{``Interrupt-Retrieve-Generate" (IRG)}$ mechanism to actively inject step-specific directives into the model's reasoning process. Extensive evaluations show that TG-RAG achieves competitive performance, demonstrating advantages in specialized domains by ensuring faithful adherence to domain SOPs. Code is available at https://github.com/V1ncent-S/Thought-Guidance.

View full details

Oral

Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

Zander Blasingame ⋅ Chen Liu

Jul 7, 2:00 PM - 2:15 PM HALL D1

Deep generative models based on neural differential equations have become state-of-the-art for many generation tasks. These models rely on ODE/SDE solvers that integrate from a prior distribution to the data distribution; in many applications it is also highly desirable to integrate in the inverse direction. Standard solvers, however, accumulate discretization errors that prohibit *exact inversion*, an inaccuracy that is unacceptable in precision-critical applications. Existing inversion methods suffer from poor stability and low order of convergence, and are strictly limited to the ODE setting. In this work, we propose *Rex*, a family of reversible exponential (stochastic) Runge-Kutta solvers obtained by applying Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into an algebraically reversible one for both diffusion ODEs *and* SDEs. Beyond a rigorous theoretical analysis---establishing arbitrary-order convergence and a non-zero region of linear stability---we empirically demonstrate that *Rex* achieves near-machine-precision reconstruction and improves Boltzmann sampling with flow models as well as image generation and editing with diffusion models.

View full details

Oral

On Minimum Depth and Width of Floating-Point Neural Networks for Representing Floating-Point Functions

Sejun Park ⋅ Yeachan Park ⋅ Geonho Hwang

Jul 7, 2:00 PM - 2:15 PM ASEM BALLROOM 201-203

Research on the expressive power of neural networks has identified the minimum depth and width of neural networks that enable universal approximation and memorization. However, existing results are derived under exact arithmetic and cannot be directly applied to real implementations on computers, which can only use a finite set of numbers and inexact machine operations with round-off errors. In this work, we study floating-point ReLU networks that have floating-point parameters and use floating-point operations. Specifically, we investigate their minimum depth and width to represent all functions from the set of floating-point vectors $\mathbb F^d$ to the set of floating-point numbers $\mathbb F$. We first show that the minimum depth for representing all functions from $\mathbb F^d$ to $\mathbb F$ is exactly three, where two layers can be sufficient if we consider a smaller domain and/or codomain. We further show that the minimum width for representing all functions from $\mathbb F^d$ to $\mathbb F$ lies between $2d$ and $2d+4$. In addition, if we restrict the domain to non-negative floats, it lies between $d$ and $d+4$, where it can be smaller for a smaller domain, even beyond $d$. Our results show that the existing results analyzed under exact arithmetic do not extend to the floating-point setup.

View full details

Oral

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai ⋅ Hongze Fu ⋅ Jayjun Lee ⋅ Yuejiang Liu ⋅ Haoran Zhang ⋅ Jianing Yang ⋅ Chelsea Finn ⋅ Nima Fazeli ⋅ Joyce Chai

Jul 7, 2:00 PM - 2:15 PM AUDITORIUM

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce **RoboMME**: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates *temporal*, *spatial*, *object*, and *procedural* memory. We further develop a suite of 14 memory-augmented VLA variants built on the $\pi_{0.5}$ backbone to systematically explore different memory representations across multiple integration strategies. We show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at https://robomme.github.io

View full details

Oral

Towards Sub-Second Molecular Docking as a Structural Primitive: A Quantized Consistency Diffusion Framework

Kexin Zhang ⋅ Weichen Qin ⋅ Yue Teng ⋅ Jiale Yu ⋅ Yuanyuan Ma ⋅ jinyu lin ⋅ Liping Sun ⋅ Jie Zheng ⋅ Jingyi Yu

Jul 7, 2:15 PM - 2:30 PM HALL D1

Agent-centered scientific discovery is turning scientific models into always-on computational infrastructure. In this paradigm, AI agents coordinate tools, interpret feedback, and drive high-frequency research loops, requiring domain models that are both accurate and callable in real time. Molecular docking exposes this bottleneck: it provides essential structural feedback for drug discovery, yet current high-fidelity docking and co-folding models remain limited by iterative generative refinement and heavy computation. We present a compute-efficient co-folding framework that turns molecular docking into a sub-second structural primitive. Because docking methods operate under different levels of structural prior, we report accuracy under information-level-matched protocols, comparing blind settings with blind generative methods and interface-informed settings with surface- or interface-informed baselines. Our framework combines two ideas. First, Progressive Consistency Regularization (PCR) compresses diffusion dynamics into reliable few-step inference through reconstruction-anchored consistency tuning. Second, Residual-Safe Quantization preserves high-fidelity residual streams and geometry-sensitive operations in BF16 while quantizing selected compute-intensive linear transformations. Our model achieves state-of-the-art docking accuracy under the matched interface-informed protocol, reports blind docking performance separately under the matched blind protocol, and generates five conformations for a representative 256-token complex in 0.17 seconds on a single NVIDIA H20 GPU, delivering a $>300\times$ speedup over AlphaFold3 under the benchmarked setting. Together, these results move molecular docking from an offline generative simulator toward a real-time structural primitive for agent-centered drug discovery.

View full details

Oral

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

Woojin Kim ⋅ Sieun Hyeon ⋅ Jusang Oh ⋅ Jaeyoung Do

Jul 7, 2:15 PM - 2:30 PM HALL D2

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, a unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HiVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

View full details

Oral

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

Constantin Ruhdorfer ⋅ Matteo Bortoletto ⋅ Victor Oei ⋅ Anna Penzkofer ⋅ Andreas Bulling

Jul 7, 2:15 PM - 2:30 PM AUDITORIUM

We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.

View full details

Oral

Optimal Decision-Making Based on Prediction Sets

Tao Wang ⋅ Edgar Dobriban

Jul 7, 2:15 PM - 2:30 PM GRAND BALLROOM 101-105

Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decision-making. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and a toy static hazard-decision benchmark demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly. The source code to reproduce our experiments is available at https://github.com/TaoWangPenn/Risk-Optimal-Conformal-Prediction.

View full details

Oral

MuonSSM: Orthogonalizing State Space Models for Sequence Modeling

Thai Khanh Nguyen ⋅ Ngoc Bich Uyen Vo ⋅ Thieu Vo ⋅ Tan Nguyen ⋅ Cuong Pham

Jul 7, 2:15 PM - 2:30 PM HALL C

State space models (SSMs) have emerged as efficient linear-time alternatives to attention for long-sequence modeling. However, existing SSMs often suffer from instability and memory degradation over extended horizons due to poorly conditioned first-order updates and unbalanced update geometry. We introduce MuonSSM, a general framework that stabilizes SSM training by explicitly conditioning the geometry of memory updates rather than the recurrent transition matrix. MuonSSM augments SSMs with a momentum-based pathway and a lightweight Newton-Schulz transformation on low-rank input injections, yielding bounded and spectrally conditioned updates while preserving parallel scan complexity. Theory shows that MuonSSM improves gradient propagation, mitigates spectral amplification, and enriches memory representations over long horizons. Extensive experiments across language, vision, and time-series benchmarks show consistent gains in accuracy, robustness, and long-context performance when integrated into diverse SSM backbones. These results establish geometric conditioning of updates as a principled pathway to stable, scalable sequence modeling.

View full details

Oral

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan (Zenus) Wang ⋅ Chi Gui ⋅ Xing Jin ⋅ Qineng Wang ⋅ Licheng Liu ⋅ Kangrui Wang ⋅ Shiqi Chen ⋅ Linjie Li ⋅ Zhengyuan Yang ⋅ Pingyue Zhang ⋅ Yiping Lu ⋅ Jiajun Wu ⋅ Li Fei-Fei ⋅ Lijuan Wang ⋅ Yejin Choi ⋅ Manling Li

Jul 7, 2:15 PM - 2:30 PM HALL B2

RL training of multi-turn LLM agents is unstable, and reasoning quality drives task performance. Entropy, the standard reasoning-stability monitor, only measures within-input diversity and misses whether reasoning depends on the input. We identify **template collapse**: stable entropy alongside input-agnostic boilerplate, invisible to entropy and existing metrics. We diagnose it via a **mutual-information (MI) proxy** that scores cross-input distinguishability online; across tasks, MI correlates with final performance far more strongly than entropy. We then explain collapse via a **signal-to-noise ratio (SNR)** mechanism: low within-input reward variance weakens task gradients, letting input-agnostic regularization dominate and erase cross-input differences. We mitigate this with **SNR-Aware Filtering**, prioritizing high-variance prompts each iteration. Across planning, math reasoning, web navigation, and code execution, the method consistently improves input dependence and task performance.

View full details

Oral

How Many Different Outputs Can a Transformer Generate?

Maxime Meyer ⋅ Mario Michelessa ⋅ Caroline Chaux ⋅ Vincent Tan

Jul 7, 2:15 PM - 2:30 PM ASEM BALLROOM 201-203

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks—such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

View full details

Oral

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Haozhe WANG ⋅ Qixin Xu ⋅ Changpeng Wang ⋅ Taofeng Xue ⋅ Chong Peng ⋅ Wenhu Chen ⋅ Fangzhen Lin

Jul 8, 10:00 AM - 10:15 AM HALL D1

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

View full details

Oral

Exact Functional ANOVA Decomposition for Categorical Inputs Models

Baptiste Ferrere ⋅ Nicolas Bousquet ⋅ Gamboa Fabrice ⋅ Jean-Michel Loubes ⋅ Joseph Muré

Jul 8, 10:00 AM - 10:15 AM ASEM BALLROOM 201-203

Functional ANOVA offers a principled framework for interpretability by decomposing a model’s prediction into main effects and higher-order interactions. For independent features, this decomposition is well-defined, strongly linked with SHAP values, and serves as a cornerstone of additive explainability. However, the lack of an explicit closed-form expression for general dependent distributions has forced practitioners to rely on costly sampling-based approximations. We completely resolve this limitation for categorical inputs. By bridging functional analysis with the extension of discrete Fourier analysis, we derive a closed-form decomposition without any assumption. Our formulation is computationally very efficient. It seamlessly recovers the classical independent case and extends to arbitrary dependence structures, including distributions with non-rectangular support. Furthermore, leveraging the intrinsic link between SHAP and ANOVA under independence, our framework yields a natural generalization of SHAP values for the general categorical setting.

View full details

Oral

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue ⋅ Tianyu Xie ⋅ Tianyang Hu ⋅ Zijin Feng ⋅ Jiacheng Sun ⋅ Kenji Kawaguchi ⋅ Zhenguo Li ⋅ Zhi-Ming Ma

Jul 8, 10:00 AM - 10:15 AM HALL C

Efficiently scaling Large Language Models (LLMs) necessitates exploring alternatives to dominant autoregressive (AR) methods, with Masked Diffusion Models (MDMs) emerging as candidates. However, comparing AR (typically decoder-only) and MDM (often encoder-only) paradigms is confounded by differing architectures, obscuring true algorithmic and efficiency trade-offs. This research decouples these factors by evaluating MDMs within a decoder-only framework to: (1) Equitably compare MDM (as Any-Order AR) and standard AR paradigms through discrepancies on orders. (2) Investigate MDM architectural impacts on computational efficiency. We show decoder-only MDMs, despite a larger modeling space, can achieve significant inference speedups ($\sim25\times$) and comparable perplexity with techniques like temperature annealing, offering a path to reduced inference compute. This work provides insights for developing more computationally efficient foundation models by disentangling core modeling choices from architectural influences. Code is available at \url{https://github.com/scxue/AO-GPT-MDM}.

View full details

Oral

Position: Stop Automating Peer Review Without Rigorous Evaluation

Joachim Baumann ⋅ Jiaxin Pei ⋅ Sanmi Koyejo ⋅ Dirk Hovy

Jul 8, 10:00 AM - 10:15 AM GRAND BALLROOM 101-105

Large language models offer a tempting solution to address the peer review crisis. This position paper argues that **today's AI systems should not be used to produce paper reviews**. We ground this positing in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a *hivemind effect* of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through *paper laundering*: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are *necessary but not sufficient* conditions for automation. We argue that **addressing the peer review crisis requires a science of peer review automation**---not general-purpose LLMs deployed without rigorous evaluation.

View full details

Oral

Position: AI Should Facilitate Democratic Deliberation at Scale

José Ramón Enríquez ⋅ Jiaxin Pei ⋅ Alex Pentland

Jul 8, 10:00 AM - 10:15 AM HALL D2

AI systems can strengthen democracy by supporting deliberation at scale by addressing cognitive, social, platform-design, and market-driven frictions, while preserving human agency. Unlike proposals such as liquid democracy that restructure representation through vote delegation, in this position paper, we argue that AI-assisted deliberation offers a more promising path by lowering barriers to meaningful engagement without substituting machine judgment for human choice. Drawing on evidence from online platforms and experimental research, we identify four guiding principles: preserving agency and autonomy, encouraging mutual respect, promoting equality and inclusiveness, and augmenting rather than substituting active citizenship. We also address critical challenges, including alignment, sycophancy, training bias, and over-reliance on AI systems. We call on the machine learning community to develop deliberation-focused AI systems evaluated not on engagement metrics but on their capacity to facilitate informed, representative, and friction-robust discourse.

View full details

Oral

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang ⋅ Jingjie Zheng ⋅ Chenxu Fu ⋅ Wei Xu

Jul 8, 10:00 AM - 10:15 AM AUDITORIUM

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce **JAILBREAK FOUNDRY (JBF)**, a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) *JBF-LIB* for shared contracts and reusable utilities; (ii) *JBF-FORGE* for the multi-agent paper-to-module translation; and (iii) *JBF-EVAL* for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced$-$reported) attack success rate (ASR) deviation of $+0.26$ percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by more than half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

View full details

Oral

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan ⋅ Guowei Rong ⋅ Zhuo Li ⋅ Bo Chen ⋅ Mingyuan Zhou ⋅ Dandan Guo

Jul 8, 10:00 AM - 10:15 AM HALL B2

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley–Terry (BT) preference model.BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

View full details

Oral

Joint Learning in the Gaussian Single Index Model

Loucas Pillaud-Vivien ⋅ Adrien Schertzer

Jul 8, 10:15 AM - 10:30 AM ASEM BALLROOM 201-203

We consider the problem of jointly learning a one-dimensional projection and a univariate function in high-dimensional Gaussian models. Specifically, we study predictors of the form $f(x)=\varphi^\star(\langle w^\star, x \rangle)$, where both the direction $w^\star \in \mathcal{S}_{d-1}$, the sphere of $\mathbb{R}^d$, and the function $\varphi^\star: \mathbb{R} \to \mathbb{R}$ are learned from Gaussian data. This setting captures a fundamental non-convex problem at the intersection of representation learning and nonlinear regression. We analyze the gradient flow dynamics of a natural alternating scheme and prove convergence, with a rate controlled by the information exponent reflecting the *Gaussian regularity* of the function $\varphi^\star$. Strikingly, our analysis shows that convergence still occurs even when the initial direction is negatively correlated with the target. On the practical side, we demonstrate that such joint learning can be effectively implemented using a Reproducing Kernel Hilbert Space (RKHS) adapted to the structure of the problem, enabling efficient and flexible estimation of the univariate function. Our results offer both theoretical insight and practical methodology for learning low-dimensional structure in high-dimensional settings.

View full details

Oral

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen ⋅ James Chua ⋅ Clément Dumas ⋅ Kit Fraser-Taliente ⋅ Subhash Kantamneni ⋅ Julian Minder ⋅ Euan Ong ⋅ Arnab Sen Sharma ⋅ Daniel Wen ⋅ Owain Evans ⋅ Samuel Marks

Jul 8, 10:15 AM - 10:30 AM HALL D2

Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Our best AOs match or exceed white-box baselines on all four tasks and the best overall baseline on 3 of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.

View full details

Oral

Position: The AI Imperative: Scaling High-Quality Peer Review in Machine Learning

Qiyao Wei ⋅ Samuel Holt ⋅ Jing Yang ⋅ Markus Wulfmeier ⋅ Mihaela van der Schaar

Jul 8, 10:15 AM - 10:30 AM GRAND BALLROOM 101-105

Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.

View full details

Oral

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao ⋅ Akari Asai ⋅ Shannon Shen ⋅ Hamish Ivison ⋅ Varsha Kishore ⋅ Jingming Zhuo ⋅ Xinran Zhao ⋅ Molly Park ⋅ Samuel Finlayson ⋅ David Sontag ⋅ Tyler Murray ⋅ Sewon Min ⋅ Pradeep Dasigi ⋅ Luca Soldaini ⋅ Faeze Brahman ⋅ Scott Yih ⋅ Sherry Wu ⋅ Luke Zettlemoyer ⋅ Yoon Kim ⋅ Hannaneh Hajishirzi ⋅ Pang Wei Koh

Jul 8, 10:15 AM - 10:30 AM HALL B2

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu-8B substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

View full details

Oral

DroneDINO: Towards Heterogeneous Routed Mixture of Experts for Drone-based Unified Object Detection

Dongdong Li ⋅ Rui Chen ⋅ Yan Fan ⋅ Yan Liu ⋅ Yangliu Kuai ⋅ Pengfei Zhu

Jul 8, 10:15 AM - 10:30 AM HALL D1

Recently, the rapid development of low-altitude aerial applications has driven the need for drone-based unified detectors. In contrast to task-specific detectors that suffer from poor scalability across diverse scenarios, existing unified detectors leverage the Mixture-of-Experts (MoE) architecture to learn task-aware features from diverse datasets. However, the imbalanced multi-task data distribution leads to over-activation of experts for dominant tasks and under-activation for others. To enable balanced feature learning, this paper combines three detection paradigms (RGB, IR, and RGB-IR) into a unified framework termed DroneDINO. DroneDINO extends DINO by introducing heterogeneous routed MoEs that organize experts into three functional groups: shared, task-specific, and dynamic. Unlike conventional dynamic experts where the top-$k$ experts are activated for each input, the shared expert is activated for all inputs, while each task-specific expert is activated exclusively for the matching task. To ensure inputs are routed to appropriate experts and yield task-discriminative features, we propose a task-recognition auxiliary training strategy to penalize features with low task-discriminability. Experiments demonstrate the effectiveness and generalizability of DroneDINO, which consistently outperforms state-of-the-art unified and task-specific detectors across multiple drone-based detection benchmarks.

View full details

Oral

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Rahul Marchand ⋅ Art Cathain ⋅ Jerome Wynne ⋅ Philippos Giavridis ⋅ Sam Deverett ⋅ John Wilkinson ⋅ Jason Gwartz ⋅ Harry Coppock

Jul 8, 10:15 AM - 10:30 AM AUDITORIUM

Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce SandboxEscapeBench, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SandboxEscapeBench covers a spectrum of sandbox-escape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SandboxEscapeBench is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

View full details

Oral

Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion Models

Songwei Liu ⋅ Chao Zeng ⋅ Chenqian Yan ⋅ Xurui Peng ⋅ WANG ⋅ Fangmin Chen ⋅ Xing Mei

Jul 8, 10:15 AM - 10:30 AM HALL C

Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization (PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments on multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods. Specifically, it achieves a 1.2 PSNR improvement over SVDQuant on SDXL W4A4, while incurring only an additional $<$ 0.5\% time overhead.

View full details

Oral

Large Language Models Develop Novel Social Biases Through Adaptive Exploration

Addison J. Wu ⋅ Ryan Liu ⋅ Xuechunzi Bai ⋅ Thomas Griffiths

Jul 8, 10:30 AM - 10:45 AM HALL D2

As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In humans, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.

View full details

Oral

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

Rachael Hwee Ling Sim ⋅ Jue Fan ⋅ Xiao Tian ⋅ Xinyi Xu ⋅ Patrick Jaillet ⋅ Bryan Kian Hsiang Low

Jul 8, 10:30 AM - 10:45 AM GRAND BALLROOM 101-105

Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (**F**) collaborative fairness and incentivizes (**T**) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (**F**) and (**T**) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

View full details

Oral

SVRG and Beyond via Posterior Correction

Nico Daheim ⋅ Thomas Moellenhoff ⋅ James Ming Liang Ang ⋅ Mohammad Emtiyaz Khan

Jul 8, 10:30 AM - 10:45 AM ASEM BALLROOM 201-203

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed over a decade ago, these methods have never been connected to any Bayesian method at a fundamental level. Here, we fill this gap and derive surprising new connections of SVRG to a recently proposed Bayesian method called `posterior correction'. Our main contribution is to show that SVRG can be recovered as a special case of posterior-correction over isotropic-Gaussian posteriors. Novel extensions of SVRG are automatically obtained by using more flexible exponential-family posteriors. We derive two new such extensions by using Gaussian families: a Newton-like variant with novel Hessian corrections, and an Adam-like extension that scales to large problems. Our work is the first to connect SVRG to Bayes and use it to speed-up training.

View full details

Oral

Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse ⋅ Romain Fabre ⋅ Yannick Estève ⋅ Alexandre Défossez ⋅ Neil Zeghidour

Jul 8, 10:30 AM - 10:45 AM HALL B2

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across four X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide [examples](https://huggingface.co/spaces/kyutai/hibiki-zero-samples), [model weights](https://huggingface.co/kyutai/hibiki-zero-3b-pytorch-bf16), [inference code](https://github.com/kyutai-labs/hibiki-zero) and we release a [benchmark](https://huggingface.co/datasets/kyutai/Audio-NTREX-4L) containing 45h of multilingual data for speech translation evaluation.

View full details

Oral

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

Yanchen Yin ⋅ Dongqi Han ⋅ Linghui Li

Jul 8, 10:30 AM - 10:45 AM AUDITORIUM

Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We identify two functionally differentiated types: **Adversarially Compromised Heads (ACHs)** concentrated in early layers, which are suppressed under attacks, and **Safety-Aligned Heads (SAHs)** in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support the causal role of ACHs and the contribution of SAHs to robust activations: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens, providing a mechanistic account of why attacks can bypass refusal decisions through ACH suppression while leaving internal safety signals sustained by SAHs—a phenomenon we term **Robust Harmful Features**. To validate the practical significance of this robustness, we show that simply reading these persistent activations—without any training—yields competitive aggregate detection performance with strong adversarial robustness.

View full details

Oral

High-accuracy and dimension-free sampling with diffusions

Khashayar Gatmiry ⋅ Sitan Chen ⋅ Adil Salim

Jul 8, 10:30 AM - 10:45 AM HALL C

Diffusion models have shown remarkable empirical success in sampling from rich multi-modal distributions. Their inference relies on numerically solving a certain differential equation. This differential equation cannot be solved in closed form, and its resolution via discretization typically requires many small iterations to produce *high-quality* samples. More precisely, prior works have shown that the iteration complexity of discretization methods for diffusion models scales polynomially in the ambient dimension and the inverse accuracy $1/\varepsilon$. In this work, we propose a new solver for diffusion models relying on a subtle interplay between low-degree approximation and the collocation method, and we prove that its iteration complexity scales *polylogarithmically* in $1/\varepsilon$, yielding the first "high-accuracy" guarantee for a diffusion-based sampler that only uses (approximate) access to the scores of the data distribution. In addition, our bound does not depend explicitly on the ambient dimension; more precisely, the dimension affects the complexity of our solver only through the *effective radius* of the support of the target distribution.

View full details

Oral

Lottery Prior: Randomized Neural Compression for Zero-Shot Inverse Problems

Haotian Wu ⋅ Di You ⋅ Pier Luigi Dragotti ⋅ Deniz Gunduz

Jul 8, 10:30 AM - 10:45 AM HALL D1

We study zero-shot inverse problems, where a clean signal is recovered from a single degraded observation without external training data. Contrary to the common belief that such problems require highly complex models, we show that a lightweight neural network, when combined with entropy and complexity regularization in a compression-based formulation, is sufficient for high-quality restoration. We propose Lottery Prior, a compression-based inverse solver that leverages architectural priors from random networks and induces a family of implicit priors through randomness, enabling ensemble-based refinement. We further derive non-asymptotic error bounds for compression-based maximum-likelihood inverse solvers, revealing how rate–distortion constraints act as implicit regularizers. Experiments on denoising, noisy super-resolution, and inpainting demonstrate that our method achieves state-of-the-art with significantly fewer effective parameters. Project page: https://eedavidwu.github.io/LotteryPrior/

View full details

Oral

Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

Minh-Tung Luu ⋅ Hwanhee Kim ⋅ Younghwan Lee ⋅ Chang D. Yoo

Jul 8, 10:45 AM - 11:00 AM HALL B2

Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.

View full details

Oral

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen ⋅ Yuzhang Luo ⋅ Liangming Pan

Jul 8, 10:45 AM - 11:00 AM HALL D2

While mechanistic interpretability has identified interpretable circuits in large language models (LLMs), their causal origins in training data remain elusive. We introduce *mechanistic data attribution* (MDA), a scalable framework that employs influence functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention—removing or augmenting a small fraction of high-influence samples—significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model’s in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

View full details

Oral

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou ⋅ Yining Sun ⋅ Ruochong Jin ⋅ Haochen Han ⋅ Fangming Liu ⋅ Victor Chan ⋅ Alex Jinpeng Wang

Jul 8, 10:45 AM - 11:00 AM AUDITORIUM

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual–text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems.

View full details

Oral

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

Zanlin Ni ⋅ Shenzhi Wang ⋅ Yang Yue ⋅ Tianyu Yu ⋅ Weilin Zhao ⋅ Yeguo Hua ⋅ Tianyi Chen ⋅ Jun Song ⋅ YuCheng ⋅ Bo Zheng ⋅ Gao Huang

Jul 8, 10:45 AM - 11:00 AM HALL C

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1\% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Code: https://github.com/LeapLabTHU/JustGRPO.

View full details

Oral

A Causal Decomposition Approach for Fair Contextual Multi-Armed Bandits

Jiajun Chen ⋅ Jin Tian ⋅ Chris Quinn

Jul 8, 10:45 AM - 11:00 AM ASEM BALLROOM 201-203

Counterfactual reasoning is one of the fundamental facets of human cognition, involved in various tasks such as explanation, credit assignment, blame, and responsibility. It describes the queries what would have happened had some intervention been performed given that something else, corresponding to Layer 3 of the Pearl Causal Hierarchy. In this project, we examine a specific type of counterfactual quantities, called counterfactual direct (Ctf-DE), indirect (Ctf-IE), and spurious (Ctf-SE) effects for quantifying fairness in a sequential decision-making framework. Building on these measures, we formulate an online causally-fair learning problem with multiple long-term constraints and study it in both non-parametric contextual bandits and parametric logistic bandits settings. We achieve sublinear regret and violations bounds for both bandits settings with roundwise counterfactual fairness constraints (that are a priori unknown) without Slater's condition. For logistic bandits, our method achieves $\mathcal{O}(1)$ per-round time complexity using an online mirror descent estimator, yielding an efficient algorithm.

View full details

Oral

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

Ander Artola Velasco ⋅ Stratis Tsirtsis ⋅ Nastaran Okati ⋅ Manuel Gomez-Rodriguez

Jul 8, 10:45 AM - 11:00 AM GRAND BALLROOM 101-105

State-of-the-art large language models require specialized hardware and substantial energy to operate. Consequently, cloud-based services that provide access to these models have become very popular. In these services, the price users pay depends on the number of tokens a model uses to generate an output–they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription that allows a provider to maintain their average profit margin when transitioning to an incentive-compatible pricing mechanism. To complement our theoretical results, we conduct experiments with large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and prompts from a popular benchmarking platform.

View full details

Oral

Scalable Event Cloud Network for Event-based Classification

Hongwei Ren ⋅ Fei Ma ⋅ Xiaopeng LIN ⋅ Yuetong Fang ⋅ Hongxiang Huang ⋅ Yue Zhou ⋅ Yulong Huang ⋅ Haotian FU ⋅ Ziyi Yang ⋅ Youxin Jiang ⋅ Xiangqian Wu ⋅ Bojun Cheng

Jul 8, 10:45 AM - 11:00 AM HALL D1

Event cameras are biologically inspired sensors garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformations, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it has limited scalability in abstracting features of higher spatial resolution and longer temporal sequence events. In this paper, we propose a Scalable Network named SECNet to leverage Event Cloud representation. SECNet integrates polarity at the structural level by innovating the Event-based Group and Sampling module rather than only at the input level. To accommodate the surge in the number of events, SECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach not only substantially extinguishes the explosion of Multiply Accumulate Operations but also effectively abstracts spatio-temporal features. We conducted extensive experiments on ten event-based datasets, and substantiate the scalability, effectiveness, and efficiency of SECNet. Our code will be available at: https://github.com/rhwxmx/SECNet_ICML.

View full details

Oral

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory (and Its Loss' Convexity is Dispensable)

Wenxuan Zhou ⋅ Shujian Zhang ⋅ brice magdalou ⋅ John Lambert ⋅ Ehsan Amid ⋅ Richard Nock ⋅ Andrew Hard

Jul 8, 4:00 PM - 4:15 PM AUDITORIUM

Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking social choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for *non-convex* losses, the fact that *any* compliant ML analytical choice can be embedded with *any* human choice model, and a normative framework's umbrella wide enough to safeguard DPO's *extensions* (margins, length correction, ...). A *toy* experiment ``far away'' from the DPO crowd is given.

View full details

Oral

Position: The Alignment Community is Unintentionally Building a Censor’s Toolkit

Sarah Ball ⋅ Phil Hackemann

Jul 8, 4:00 PM - 4:15 PM HALL D2

This position paper argues that modern alignment methods – originally designed to prevent harmful output – are dual-use technologies that may easily be misused by malicious actors for censorship and manipulation. By mapping current alignment techniques to the possibility and actual cases of misuse, we show that the quest for a ''perfectly aligned'' model inadvertently also provides malicious actors with an ever-improving tool for informational dominance. We need to discuss this dual-use potential *now*, as its risk is exacerbated by rapid user adoption of AI as information provider and a political landscape that increasingly shifts towards authoritarianism. We conclude by urging the community to consider the intentional misuse of safety mechanisms and propose mitigation strategies to safeguard against this dual-use potential.

View full details

Oral

CoEvol-NO: State and Coordinate Co-Evolution with an Error-Driven Predictor-Corrector Paradigm for Neural Operator Transformer

Jianqiao Zeng ⋅ Ruocheng Wang ⋅ Yanzhi Liu ⋅ Hao Xiong ⋅ Junchi Yan

Jul 8, 4:00 PM - 4:15 PM GRAND BALLROOM 101-105

Despite the fast progress in neural operator learning, long-sequence modeling still is a standing challenge whereby latent states have been introduced with techniques well derived. Diverging from existing methods that treat latent states as transient variables or decoupled representations, CoEvol-NO introduces a persistent state to establish a co-evolutionary framework, where the latent state and mesh sequence are updated jointly and bidirectionally. Inspired by classical numerical methods, we model the layer-wise state evolution as a Predictor-Corrector (PC) process. Specifically, a "Predictor'' generates a tentative target, followed by a "Corrector'' that refines the persistent state via an {error-driven update mechanism}. Furthermore, our theoretical analysis reveals that the widely used \textit{direct substitution} and \textit{residual update} paradigms are essentially {first-order approximations} of this error-driven correction under different loss assumptions. We theoretically prove that CoEvol-NO achieves strict linear time complexity. Extensive experiments on five standard benchmarks and two large-scale industrial design tasks demonstrate that CoEvol-NO consistently achieves state-of-the-art (SOTA) performance.

View full details

Oral

ConFlux: Multivariate Time Series in Flux, One Unified Forecast in Confluence

Shiyu Wang ⋅ Juntong Ni ⋅ Ziyi Zhang ⋅ Baichuan Mo ⋅ Xinyue Zhong ⋅ Chengxin Wang ⋅ Yuchen Fang ⋅ Zhou Ye ⋅ Yang Xiang

Jul 8, 4:00 PM - 4:15 PM ASEM BALLROOM 201-203

Real-world multivariate time series are inherently in flux: different variables evolve asynchronously and interact in complex, time-varying ways, yet accurate forecasting requires these dispersed signals to converge into a single unified prediction. This structural mismatch between dynamic, heterogeneous inputs and a unified forecasting objective poses a fundamental challenge for building general-purpose multivariate forecasting models, especially in zero-shot and large-scale settings. To this end, inspired by the idea that ``\emph{all rivers run into the sea}'', we propose \textbf{ConFlux}, a \emph{general-purpose foundation model for multivariate time-series forecasting} by learning to adaptively integrate cross-channel information under a unified forecasting objective. Specifically, ConFlux first reorders variables to reduce cross-variable entanglement, then aggregates adjacent variables into compact patches that can be processed by a Vision Transformer-style architecture. This design shortens the effective context, reduces attention complexity, and provides a unified token representation for pre-training and downstream tasks. Experiments on 25 public datasets show that ConFlux achieves state-of-the-art performance in zero-shot, fine-tuning, and from-scratch settings, while offering faster inference and lower memory usage.

View full details

Oral

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

Bing Hu ⋅ Zaijing Li ⋅ Rui Shao ⋅ Junda Chen ⋅ April Hua Liu ⋅ Wei-Shi Zheng ⋅ Liqiang Nie

Jul 8, 4:00 PM - 4:15 PM HALL B2

Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the \textbf{Phase-conditioned Behavior Decoder (PBD)}, which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58\%, 98\%, and 4.36 (Avg.Len), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50\% of the demonstration data, showcasing its superior data efficiency and generalization.

View full details

Oral

A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

Zheng Li ⋅ Feng Xie ⋅ Shenglan Nie ⋅ Xichen Guo ⋅ Ruxin Wang ⋅ Hao Zhang

Jul 8, 4:00 PM - 4:15 PM HALL D1

Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

View full details

Oral

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

Shigeng Wang ⋅ Chao Li ⋅ Yangyuxuan Kang ⋅ Jiawei Fan ⋅ Anbang Yao

Jul 8, 4:00 PM - 4:15 PM HALL C

In this paper, we present CAT-Q, **C**ost-efficient and **A**ccurate **T**ernary **Q**uantization, for compressing and accelerating LLMs. Unlike existing state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q is a simple yet effective post-training quantization scheme that is readily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of pre-trained high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a differentiable transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can efficiently quantize them into ternary models using only 512 calibration samples, while achieving superior performance than the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000$\times$ reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize much larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within just 8 to 60 hours on 8 A100-80GB GPUs. Code is available at https://github.com/IntelChina-AI/BitTern.

View full details

Oral

Information Flow Reveals When to Trust Language Models

Rui Xu ⋅ Yi Chen ⋅ Jiujiu Chen ⋅ Sihong Xie

Jul 8, 4:15 PM - 4:30 PM HALL D2

In retrieval-augmented generation, language models can generate incorrect responses if they fail to utilize query-relevant content from the retrieved evidence. This shifts the focus of uncertainty quantification (UQ) toward assessing contextual grounding, i.e., whether predictions are supported by query-relevant tokens. Recent UQ methods unpack language models to characterize how inputs are processed. Nevertheless, these methods focus on a few layers and overlook the whole progressive propagation within the model, thereby failing to fully capture the grounding dynamics essential for reliable uncertainty estimation. We use information flow to build a layer-wise trace that reveals each context token’s contribution to the output, providing an interpretable basis for assessing reliability. From this analysis, we introduce two measures to calibrate prediction confidence. The first, \textit{simulatability}, posits that a prediction is more likely to be correct when context token contributions align closely with their true relevance. The second, \textit{concentration}, asserts that a response is more likely to be correct when it is derived from a narrow, focused subset of tokens. Experiments show that our method achieves an average AUROC of 0.709, exceeding the runner-up performance of 0.676, while maintaining moderate computational cost.

View full details

Oral

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

Emre Kavak ⋅ Tom Nuno Wolf ⋅ Christian Wachinger

Jul 8, 4:15 PM - 4:30 PM HALL D1

Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in gradient-based models. Across six diverse datasets, our methods consistently outperform or are competitive in existing observed bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/yakamoz5/DISCO.

View full details

Oral

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Zeju Qiu ⋅ Lixin LIU ⋅ Adrian Weller ⋅ Han Shi ⋅ Weiyang Liu

Jul 8, 4:15 PM - 4:30 PM HALL C

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

View full details

Oral

Geometric Flow Grounding: A Unified Manifold Decoupling Framework for Dynamics Discovery and Verification

Chang Yu ⋅ Yuxuan Luo ⋅ Yixuan Du ⋅ Yuqing Zhou ⋅ Siyuan Li ⋅ Jingbo Zhou ⋅ jiawei jiang ⋅ Zhen Lei ⋅ Stan Z Li

Jul 8, 4:15 PM - 4:30 PM GRAND BALLROOM 101-105

Modeling complex dynamics from observational data is fundamental to scientific discovery and artificial intelligence. However, existing approaches are often plagued by the entanglement of static state representations and instantaneous motion, leading to accumulated errors and off-manifold hallucinations where predicted trajectories violate intrinsic geometric constraints. To address this, we propose Geometric Flow Grounding, a unified framework that enforces dynamic evolution strictly along the tangent bundle of the learned data manifold via a differentiable Neural Tangent Projection Layer. By geometrically decoupling state representation from tangential dynamics, our method generalizes across diverse data regimes. In scientific discovery, GFG reduces numerical aliasing and improves long-horizon stability in sparse dynamical systems, while recovering interpretable gene regulatory motifs from single-cell data. For trustworthy AI, the projection residual provides a zero-shot metric for deepfake video detection by revealing inconsistencies with the implicit flow of pre-trained world models. Our results establish manifold-constrained projection as a universal operator for both discovering natural laws and verifying synthetic content. Code will be available at \url{https://github.com/yuchang97/GFG-public}

View full details

Oral

High-Accuracy Sampling for Diffusion Models and Log-Concave Distributions

Fan Chen ⋅ Sinho Chewi ⋅ Constantinos Daskalakis ⋅ Alexander Rakhlin

Jul 8, 4:15 PM - 4:30 PM AUDITORIUM

We present algorithms for diffusion model sampling which obtain $\delta$-error in $\mathrm{polylog}(1/\delta)$ steps, given access to $\widetilde O(\delta)$-accurate score estimates in $L^2$. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is $\widetilde O(d\mathrm{polylog}(1/\delta))$ where $d$ is the dimension of the data; under a non-uniform $L$-Lipschitz condition, the complexity is $\widetilde O(\sqrt{dL}\mathrm{polylog}(1/\delta))$; and if the data distribution has intrinsic dimension $d_\star$, then the complexity reduces to $\widetilde O(d_\star\mathrm{polylog}(1/\delta))$. Our approach also yields the first $\mathrm{polylog}(1/\delta)$ complexity sampler for general log-concave distributions using only gradient evaluations.

View full details

Oral

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Yihan Lin ⋅ Haoyang Li ⋅ Yang Li ⋅ Haitao Shen ⋅ Yihan Zhao ⋅ Chao Shao ⋅ Jing Zhang

Jul 8, 4:15 PM - 4:30 PM HALL B2

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training.

View full details

Oral

From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

Lehui Li ⋅ Yuyao Wang ⋅ Jisheng Yan ⋅ Wei Zhang ⋅ Jinliang Deng ⋅ Haoliang Sun ⋅ Zhongyi Han ⋅ Yongshun Gong

Jul 8, 4:15 PM - 4:30 PM ASEM BALLROOM 201-203

Incorporating textual information into time-series forecasting holds promise for addressing event- driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives—distribution shift, volatility, shape, and lag—extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29% reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. Code is available at: https://github.com/olivia3395/TESS.

View full details

Oral

Modeling Hierarchical Thinking in Large Reasoning Models

G M Shahariar ⋅ Erfan Shayegani ⋅ Ali Nazari ⋅ Nael Abu-Ghazaleh

Jul 8, 4:30 PM - 4:45 PM HALL D2

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose $Q$-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that $Q$-Value steering policy achieves significant performance gains with "surgical'' efficiency, often requiring $25\times$ fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

View full details

Oral

Solving Time-Dependent Differential Equations with Physical Dynamical Systems

Chuan Liu ⋅ Yijie Chen ⋅ Ruibing Song ⋅ Wenhao Huang ⋅ Chunshu Wu ⋅ Deqian Kong ⋅ Ying Nian Wu ⋅ Kaiyuan Yang ⋅ Ang Li ⋅ Tony Geng

Jul 8, 4:30 PM - 4:45 PM GRAND BALLROOM 101-105

Time-Dependent Differential Equations (TDDEs) model dynamical processes across science and engineering, but time-critical applications require solvers that deliver high-fidelity trajectories under stringent latency constraints. Most existing TDDE solvers are limited by time discretization, forcing a latency-accuracy trade-off where smaller step sizes capture high-fidelity trajectories but incur prohibitive runtime, while larger steps meet real-time budgets at the cost of trajectory distortion. Dynamical System Machines (DSMs) offer a promising alternative by computing through continuous physical evolution, yet existing DSMs struggle to capture the spatiotemporal complexity of TDDEs. This work introduces DS-TS, a novel TDDE solver that is both accurate and efficient by leveraging the unique computational advantages of DSMs. DS-TS integrates three key innovations: (1) Excitatory-Inhibitory Inspired Coupling to better model complex spatial interactions; (2) State-aware Dynamic Nonlinearity to enable rich inter-node interactions and state-dependent spatiotemporal correlations; and (3) Hierarchical Temporal Integration to capture high-order temporal dependencies. Experiments demonstrate that DS-TS achieves high-fidelity solutions while delivering orders-of-magnitude improvements in speed ($\sim 10^3\times$) and energy efficiency ($\sim 10^5\times$) compared to baseline solvers.

View full details

Oral

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

Janghwan Lee ⋅ Sihwa Lee ⋅ Jinseok Kim ⋅ Yongjik Kim ⋅ Jieun Lim ⋅ Jinwook Oh ⋅ Jungwook Choi

Jul 8, 4:30 PM - 4:45 PM HALL C

Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens—precise symbolic commitments such as digits and operators—where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy, while delivering up to $3.9\times$ throughput speedup on NVIDIA DGX Spark and $3.1\times$ on B200. The project repository is available at https://github.com/aiha-lab/ReQAT.

View full details

Oral

Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference

Shengxian Ding ⋅ Haonan Gao ⋅ Pangpang Liu ⋅ Xinyuan Tian ⋅ Yize Zhao

Jul 8, 4:30 PM - 4:45 PM HALL D1

Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around **latent, risk-factor-modulated disease pathways**. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

View full details

Oral

Rational Transductors

Mehryar Mohri

Jul 8, 4:30 PM - 4:45 PM AUDITORIUM

Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to $\mathsf{AC}^0$ (under hard attention) or $\mathsf{TC}^0$ (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought (see hahn2020theoretical and merrill2022saturated). In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a *Deep Rational Injection* scheme, our framework strictly generalizes Transformers to capture all Regular Languages, $\mathsf{NC}^1$-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving $O(\log T)$ parallel training efficiency. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.

View full details

Oral

Markov Chain Monte Carlo without Evaluating the Target: an Auxiliary Variable Approach

Wei Yuan ⋅ Guanyang Wang

Jul 8, 4:30 PM - 4:45 PM ASEM BALLROOM 201-203

In sampling tasks, it is common for target distributions to be known up to a normalizing constant. However, in many situations, even evaluating the unnormalized distribution can be costly or infeasible. This issue arises in scenarios such as sampling from the Bayesian posterior for tall datasets and the `doubly-intractable' distributions. In this paper, we begin by observing that seemingly different Markov chain Monte Carlo (MCMC) algorithms, such as the exchange algorithm, PoissonMH, and TunaMH, can be unified under a simple common procedure. We then extend this procedure into a novel framework that allows the use of auxiliary variables in both the proposal and the acceptance--rejection step. Several new MCMC algorithms emerge from this framework that uses estimated gradients to guide the proposal moves. They have demonstrated significantly better performance than existing methods on both synthetic and real datasets. We also develop theory for the new framework and use it to simplify and extend results for existing algorithms. The code to reproduce the experimental results can be found at https://github.com/ywwes26/Auxiliary-MCMC.

View full details

Oral

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Huihan Liu ⋅ Changyeon Kim ⋅ Bo Liu ⋅ Minghuan Liu ⋅ Yuke Zhu

Jul 8, 4:30 PM - 4:45 PM HALL B2

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we find that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we find that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

View full details

Oral

On the Identifiability of Poisson Branching Structural Causal Model Under Latent Confounding

Jie Qiao ⋅ Zihuai Zeng ⋅ Ruichu Cai ⋅ Zhengming Chen ⋅ Zhifeng Hao

Jul 8, 4:45 PM - 5:00 PM HALL D1

Causal discovery from observational count data poses unique challenges, particularly when the data exhibit inherent branching structures, such as an upstream ad impression event triggering a downstream purchase event with certain probability. Such branching dynamics are naturally modeled by thinning operators (for branching) and an independent Poisson distribution (for exogenous noise), constituting a Poisson Branching Structural Causal Model (PB-SCM). However, existing approaches based on PB-SCM rely on the restrictive assumption of causal sufficiency, failing to account for ubiquitous latent confounders. In this work, we propose a Latent Confounding Poisson Branching Structural Causal Model (LC-PB-SCM) to bridge this gap. We leverage Probability Generating Function (PGF) to characterize the complex dependencies introduced by latent confounding. Then, we establish a Trie representation theorem that maps the branching structure to algebraic properties of PGF monomials. Based on local PGF, we establish a complete identifiability condition for local 3-variables covering all causal patterns distinguishable up to monomial equivalence. Finally, we propose a practical algorithm to learn causal structures under latent confounding and demonstrate its effectiveness through experiments on both synthetic and real-world datasets.

View full details

Oral

Training-Free Bayesian Filtering with Generative Emulators

Thomas Savary ⋅ François Rozet ⋅ Gilles Louppe

Jul 8, 4:45 PM - 5:00 PM GRAND BALLROOM 101-105

Bayesian filtering is a well-known problem that aims to estimate plausible states of a dynamical system from observations. Among existing approaches to solve this problem, particle filters are theoretically exact for non-linear dynamics and observations, but suffer from poor scalability in high dimensions. In this work, we show that diffusion-based emulators of dynamical systems can be used to implement, without additional training, an optimal variant of particle filters that has remained largely unexplored due to implementation challenges with classical numerical solvers. Experiments on nonlinear chaotic systems, including atmospheric dynamics, demonstrate that the proposed approach successfully scales particle filtering to high-dimensional settings.

View full details

Oral

Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

Ziyue Li ⋅ Yang Li ⋅ Tianyi Zhou

Jul 8, 4:45 PM - 5:00 PM HALL C

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic *program-of-layers (PoLar)*, where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM’s latent reasoning capacity.

View full details

Oral

Path-dependent Discrete Amortized Inference

Tiago Silva ⋅ Esmeralda S Whitammer ⋅ Salem Lahlou

Jul 8, 4:45 PM - 5:00 PM ASEM BALLROOM 201-203

We consider the problem of sampling compositional and discrete objects from a given unnormalized posterior distribution. Notably, recent studies have shown that this problem can be efficiently solved by learning a deterministic Markov Decision Process (MDP) that progressively builds each object in proportion to the posterior. In this work, however, we demonstrate that the Markovian assumption can both hamper signal propagation during training and catastrophically reduce the learned sampler's expressivity due to state aliasing. To address these issues, we propose lifting the MDP with a learnable latent dynamics that allows the underlying policy to depend on the entire past trajectory---and not only on the current state. In view of this, we refer to the resulting method as \emph{path-dependent discrete amortized inference}. Importantly, we provably extend existing learning algorithms for amortized samplers to our setting. In experiments on standard benchmark problems, we also show that our approach often leads to faster learning convergence and improved state space exploration relatively to prior techniques.

View full details

Oral

Reward-free Alignment for Conflicting Objectives

Peter Chen ⋅ Xiaopeng Li ⋅ Xi Chen ⋅ Tianyi Lin

Jul 8, 4:45 PM - 5:00 PM HALL D2

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a **R**eward-free **A**lignment framework for **C**onflicted **O**bjectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

View full details

Oral

To Grok Grokking: Provable Grokking in Ridge Regression

Mingyue Xu ⋅ Gal Vardi ⋅ Itay Safran

Jul 8, 4:45 PM - 5:00 PM AUDITORIUM

We study *grokking* — the onset of generalization long after overfitting — in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

View full details

Oral

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan ⋅ Kun Wu ⋅ Zhengping Che ⋅ Xinhua Wang ⋅ Di Wu ⋅ Fei Liao ⋅ Ning Liu ⋅ Yixue Zhang ⋅ Zhen Zhao ⋅ Zhiyuan Xu ⋅ Meng Li ⋅ Qingjie Liu ⋅ Shanghang Zhang ⋅ Min Wan ⋅ Jian Tang

Jul 8, 4:45 PM - 5:00 PM HALL B2

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (\textit{i}) producing precise low-level actions from high-dimensional observations, (\textit{ii}) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present \textbf{XR-1}, a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. At its core, XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (\textit{i}) serving as an intermediate representation between the observations and actions, and (\textit{ii}) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a \emph{three-stage training paradigm}: (\textit{i}) self-supervised UVMC learning, (\textit{ii}) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (\textit{iii}) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 12,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_0$ and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at \href{https://xr-1-vla.github.io/}{https://xr-1-vla.github.io/}.

View full details

Oral

FlatLand: Personalized Graph Federated Learning via Tailored Lorentz Space

Jiahong Liu ⋅ Ram Samarth B B ⋅ Xinyu Fu ⋅ Menglin Yang ⋅ Weixi Zhang ⋅ ZHITAO YING ⋅ Irwin King

Jul 9, 10:00 AM - 10:15 AM AUDITORIUM

Personalization has become a pivotal field of study in contemporary intelligent systems. Federated learning enables privacy-preserving collaborative training, but highly heterogeneous client data remain challenging, especially in graph federated learning where clients possess structurally diverse graphs. Existing personalized federated learning (PFL) methods ignore the intrinsic geometric properties of diverse graph structures. We propose FlatLand, a novel personalized Federated learning method that embeds different clients' data in tailored Lorentz space of hyperbolic geometry. Our key insight is that hyperbolic geometry naturally accommodates the intrinsic negative curvature prevalent in real-world graphs, while the time-like dimension in Lorentz space provides a principled way to encode client-specific heterogeneity. We develop a parameter decoupling strategy that separates heterogeneous information (captured in time-like parameters) from common knowledge (preserved in space-like parameters), enabling direct aggregation without requiring client similarity estimation and extra calculation modules. Empirical results on diverse federated graph learning tasks demonstrate that FlatLand achieves superior performance, particularly in low-dimensional settings. Code is available in our GitHub repository.

View full details

Oral

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

Muquan Li ⋅ Yingyi Ma ⋅ Yihong Huang ⋅ Hang Gou ⋅ KE QIN ⋅ Ming Li ⋅ Yuan-Fang Li ⋅ Tao He

Jul 9, 10:00 AM - 10:15 AM HALL B2

Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy–robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a \emph{perturbation score} that approximates each sample’s robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by 2.8% on average.

View full details

Oral

Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu ⋅ Graham Neubig ⋅ Chenyan Xiong

Jul 9, 10:00 AM - 10:15 AM HALL C

Midtraining, the practice of mixing specialized data with more general pretraining data in an intermediate training phase, has become widespread in language model development, yet there is little understanding of what makes it effective. We propose that midtraining functions as distributional bridging by providing better initialization for posttraining. We conduct controlled pretraining experiments, and find that midtraining benefits are largest for domains distant from general pretraining data, such as code and math, and scale with the proximity advantage the midtraining data provides toward the target distribution. In these domains, midtraining consistently outperforms continued pretraining on specialized data alone both in-domain and in terms of mitigating forgetting. We further conduct an investigation on the starting time and mixture weight of midtraining data, using code as a case study, and find that time of introduction and mixture weight interact strongly such that early introduction of specialized data is amenable to high mixture weights, while late introduction requires lower ones. This suggests that late introduction of specialized data outside a plasticity window cannot be compensated for by increasing data mixtures later in training. Beyond midtraining itself, this suggests that distributional transitions between any training phases may benefit from similar bridging strategies.

View full details

Oral

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Arnas Uselis ⋅ Andrea Dittadi ⋅ Seong Joon Oh

Jul 9, 10:00 AM - 10:15 AM GRAND BALLROOM 101-105

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems, yet modern models, despite massive training sets, see only a tiny fraction of the combinatorial input space. We ask what structure representations {must} have to support generalization to unseen combinations. We formalize three desiderata (divisibility, transferability, stability) and show they impose necessary geometric constraints under standard training: representations must decompose linearly into per-concept components, orthogonal across concepts. This grounds the Linear Representation Hypothesis as a necessary consequence of compositional generalization, and yields dimension bounds linking the number of composable concepts to embedding geometry. Empirically, across CLIP, SigLIP, and DINO, we find partial linear factorization with low-rank near-orthogonal per-concept factors, and the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the geometry they may converge to. Code: https://github.com/oshapio/necessary-compositionality

View full details

Oral

Distributional Inverse Reinforcement Learning

Feiyang Wu ⋅ Ye Zhao ⋅ Anqi Wu

Jul 9, 10:00 AM - 10:15 AM HALL D2

We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis show that the algorithm converge with $\mathcal{O}(\varepsilon^{-2})$ iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

View full details

Oral

Position: There are futures that benchmark-driven AI cannot see

Sobhan Lotfi ⋅ Ava Iranmanesh ⋅ Lachin Naghashyar ⋅ Ali Shirali ⋅ Fateme Haredasht ⋅ Sanmi Koyejo ⋅ Phil Torr ⋅ Yong Suk Lee ⋅ Fazl Barez ⋅ Joel Lehman ⋅ Peter Norvig ⋅ Arvind Narayanan

Jul 9, 10:00 AM - 10:15 AM HALL D1

Breakthroughs often come from ideas we could not have predicted in advance. In biology, this is called exaptation: traits evolved for one function become decisive for another. Scientific progress works similarly, but only if ideas survive periods when they appear uncompetitive by current metrics. This position paper argues that AI's benchmark-centered selection environment, while successful at bypassing complex debates about the nature of intelligence, taxes exaptation. When one selection rule dominates, ideas that do not fit it have nowhere to persist. The cost grows acute as the field shifts from asking can machines exhibit intelligent behavior? to asking can machines exhibit intelligent behavior such that they are aligned, interpretable, and safe? These are philosophically distinct questions that may require discoveries that we cannot specify. We propose mechanisms to restore exaptive capacity without abandoning benchmarking: plural evaluation regimes, protected venues for non-comparable work, long-horizon funding, and training norms that encourage researchers to question selection rules, not only optimize within them.

View full details

Oral

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

Shaoxiong Zhan ⋅ Yanlin Lai ⋅ Zheng Liu ⋅ Lin Hai ⋅ Shen Li ⋅ Xiaodong Cai ⋅ Zijian Lin ⋅ Wen Huang ⋅ Hai-Tao Zheng

Jul 9, 10:00 AM - 10:15 AM ASEM BALLROOM 201-203

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical "spatial intelligence gap," where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce **3ViewSense**, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a "Simulate-and-Reason" mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.

View full details

Oral

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu ⋅ Kaiwen Xiong ⋅ Peng Xia ⋅ Yiyang Zhou ⋅ Haonian Ji ⋅ Lu Feng ⋅ Siwei Han ⋅ Mingyu Ding ⋅ Huaxiu Yao

Jul 9, 10:15 AM - 10:30 AM ASEM BALLROOM 201-203

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on chart reasoning, geometric problem solving, and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the Qwen-VL base model.

View full details

Oral

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long (Tony) Lian ⋅ Sida Wang ⋅ Felix Juefei-Xu ⋅ Tsu-Jui Fu ⋅ Xiuyu Li ⋅ Adam Yala ⋅ Trevor Darrell ⋅ Alane Suhr ⋅ Yuandong Tian ⋅ Xi Victoria Lin

Jul 9, 10:15 AM - 10:30 AM HALL C

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but their inherently sequential decoding incurs substantial latency, motivating parallelization of the generation process. However, existing parallel reasoning approaches suffer from performance degradation compared to their sequential counterparts, and often rely on specialized inference engines. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that matches the accuracy of comparably sized sequential reasoning models while significantly reducing inference latency via three key innovations: 1) a two-stage parallel trajectory generator that produces high-quality parallel chain-of-thought data for supervised fine-tuning; 2) a trie-based rollout design that enables parallel reasoning on any off-the-shelf autoregressive inference engine; and 3) a parallelization-aware reinforcement learning framework that trains the model to balance reasoning accuracy with effective parallelization. Across six challenging math reasoning benchmarks, ThreadWeaver trained on top of Qwen3-8B achieves performance on par with cutting-edge sequential reasoning models (79.9% on AIME24 and 71.9% on average) while delivering up to 1.53x speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

View full details

Oral

MV-FGAD: Towards Efficient and Effective Federated Graph Anomaly Detection via Multi-view Learning

Junyi Yan ⋅ KE LIANG ⋅ Hao Yu ⋅ Meng Liu ⋅ Hao Tan ⋅ Tianrui Liu ⋅ Jun-Jie Huang ⋅ Xinwang Liu

Jul 9, 10:15 AM - 10:30 AM AUDITORIUM

Federated graph anomaly detection (GAD) aims to identify abnormal nodes in distributed subgraphs through federated learning. However, existing methods suffer from two limitations. 1) Their reliance on neighborhood aggregation assumes that anomalous information can be sufficiently captured, which often fails in federated learning with partitioned client subgraphs. 2) They overlook the detection bottleneck caused by weak attribute or structural anomalies. To tackle these challenges, we revisit federated GAD and reveal that weak anomalies exhibit harder-to-detect signals compared to strong anomalies. Specifically, we propose MV-FGAD, an efficient and effective federated GAD framework for mining anomalies of varying strengths. MV-FGAD introduces a federated knowledge learning module to aggregate and broadcast shared knowledge, which is further exploited to optimize local topological structures. Moreover, it designs a multi-view learning mechanism to capture diverse anomaly patterns, and adopts Mahalanobis distance–based scoring strategy to quantify node abnormality across views. Extensive experiments on real-world datasets of varying types and scales demonstrate MV-FGAD's efficiency and effectiveness. Our code is publicly available at https://github.com/Junyi-Yan/MV-FGAD.

View full details

Oral

Non-Euclidean Gradient Descent Operates at the Edge of Stability

Rustem Islamov ⋅ Michael Crawshaw ⋅ Jeremy Cohen ⋅ Robert Gower

Jul 9, 10:15 AM - 10:30 AM GRAND BALLROOM 101-105

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian approaches and then hovers near the stability threshold $2/\eta$ during gradient descent (GD) with step size $\eta$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness [Mishkin et al., 2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and their normalized versions. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/\eta$. Practically, our framework provides a geometry-aware spectral diagnostic that can be applied across a broad class of non-Euclidean gradient methods.

View full details

Oral

CausalGame: Benchmarking Causal Thinking of LLM Agents in Games

Zhenhao Chen ⋅ Yongqiang Chen ⋅ Chenxi Liu ⋅ Junchi Yu ⋅ Xiangchen Song ⋅ Zijian Li ⋅ Jialin Li ⋅ Phil Torr ⋅ Bo Han ⋅ Kun Zhang

Jul 9, 10:15 AM - 10:30 AM HALL D1

Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguish causation from correlation and hidden biases, is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, they do not explicitly incorporate challenges from hidden confounders, selection bias, and noisy measurements that widely exist in real-world scientific discovery. To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 29 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a controlled testbed for evaluating causal thinking of AI Scientist agents. The project is available at causalgame.github.io .

View full details

Oral

On the Difficulty of Learning a Meta-network for Training Data Selection

Zilin Du ⋅ Junqi Zhao ⋅ Albert Boyang Li

Jul 9, 10:15 AM - 10:30 AM HALL B2

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49\% over training without selection and 2.89\% over the strongest baseline.

View full details

Oral

On the Role of Computation in Reinforcement Learning

Raj Ghugare ⋅ Michał Bortkiewicz ⋅ Alicja Ziarko ⋅ Benjamin Eysenbach

Jul 9, 10:15 AM - 10:30 AM HALL D2

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using upto 5 times more parameters.

View full details

Oral

Characterizing, Evaluating, and Optimizing Complex Reasoning

Haoran Zhang ⋅ Yafu Li ⋅ Zhi Wang ⋅ Zhilin Wang ⋅ Shunkai Zhang ⋅ Xiaoye Qu ⋅ Yu Cheng

Jul 9, 10:30 AM - 10:45 AM HALL D1

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3\% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9\% gain) across diverse tasks. Code and data are available at https://github.com/Simplified-Reasoning/TRM.

View full details

Oral

PhenoBrain: Phenotype-Conditioned Long-Range Communication for Multi-Modal Brain Network Analysis

Lingyuan Meng ⋅ KE LIANG ⋅ Hao Li ⋅ Meng Liu ⋅ Weijia Shi ⋅ Miaomiao Li ⋅ Yang Gao ⋅ Xinwang Liu

Jul 9, 10:30 AM - 10:45 AM AUDITORIUM

Multi-modal brain network analysis aims to predict neuropsychiatric status from functional connectomes with heterogeneous phenotypes. However, most existing methods treat phenotypes as auxiliary features and perform late fusion, implicitly assuming that the connectome representation should be learned in the same way regardless of phenotype. However, in clinical neuroscience the same functional connectivity pattern may support different conclusions under different phenotype contexts. To bridge this gap, we propose PhenoBrain, a novel framework for multi-modal brain network analysis that injects phenotype information at the mechanism level rather than only at the classifier level. Specifically, we propose a phenotype-conditioned long-range routing mechanism, which learns a subject-specific multi-hop communication kernel to model long-range connectome interactions. Furthermore, we propose a phenotypic-guided attention mechanism regulation method, which uses phenotypic information as a conditional prior to regulate the learning process of attention in brain networks. To verify the effectiveness of our method, we constructed two multi-modal brain network analysis datasets based on open-source image data. Extensive experiments demonstrate that PhenoBrain achieves state-of-the-art performance.

View full details

Oral

The Signal is in the Steps: Local Scoring for Reasoning Data Selection

Hoang Anh Just ⋅ Myeongseob Ko ⋅ Ruoxi Jia

Jul 9, 10:30 AM - 10:45 AM HALL B2

Distilling long-form reasoning from teacher models into smaller students requires selecting which candidate solutions to train on. Recent work argues that one should select responses the student model assigns highest probability, i.e., favoring solutions ``natural'' to the student. However, we find that this approach works within a single teacher but fails when scaling to long reasoning traces from multiple diverse teachers. We identify a key cause: this approach scores entire solutions, but students generalize by recombining familiar reasoning steps, not by memorizing complete solutions. Full-trajectory scoring optimizes the wrong target; it rewards global fluency while the transferable signal lies in local step transitions. We propose Local Average Log Probability (LALP), which scores each reasoning step using only a small window of preceding context, measuring whether each step is justified by its immediate premises rather than whether the full response looks natural to the student. LALP enables two practical use cases: selecting the best teacher before fine-tuning and curating training data from diverse teacher pools. Across math, coding, and science reasoning tasks, LALP consistently improves accuracy when selecting the most natural solutions by a large margin.

View full details

Oral

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

Shihao Wang ⋅ Xueru Zhang

Jul 9, 10:30 AM - 10:45 AM GRAND BALLROOM 101-105

Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA's low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix $Z = AB^\top$, but this factorization is inherently *non-identifiable*: many factor pairs $(A, B)$ represent the same update $Z$. As a result, applying DP-SGD directly to the factors induces *gauge-dependent* perturbations on $Z$, and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose **PRISM**, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization of the effective intrinsic noise induced on $Z$, enabling stable privacy–utility trade-offs through bounded, gauge-invariant perturbations. We establish standard $(\varepsilon,\delta)$-DP guarantees for PRISM and introduce a DP-aware, gauge-invariant adaptive update rule that prevents adaptive optimization from amplifying injected privacy noise, improving numerical stability in practice.

View full details

Oral

Second-Order Smooth Planning with Optimal-Transport Bellman Smoothing

Tuan Dam

Jul 9, 10:30 AM - 10:45 AM HALL D2

Planning with a generative model aims to estimate the value of a state using as few simulator calls as possible. SmoothCruiser achieves problem-independent complexity $\widetilde O(\varepsilon^{-4})$ by exploiting the smoothness of the entropy-regularized Bellman backup, but its estimator is only first-order. We show that the sample-complexity exponent of SmoothCruiser-type planners is governed by the order $\beta$ of the local Taylor remainder, giving oracle complexity $\widetilde O(\varepsilon^{-(2+2/(\beta-1))})$: the first-order case $\beta=2$ recovers SmoothCruiser, while a second-order/cubic remainder $\beta=3$ yields $\widetilde O(\varepsilon^{-3})$. We reach this regime with an optimal-transport-smoothed Bellman backup over action distributions, which has a closed form, a policy gradient, and a Lipschitz Hessian, and whose quadratic correction admits an unbiased cross-product estimator. The resulting SecondOrderSmoothCruiser achieves $\widetilde O(\varepsilon^{-3})$ oracle complexity for fixed OT parameters, and we relate the OT, entropy-regularized, and unregularized objectives through explicit regularization-bias bounds.

View full details

Oral

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Gül Sena Altıntaş ⋅ Malikeh Ehghaghi ⋅ Brian Lester ⋅ Fengyuan Liu ⋅ Wanru Zhao ⋅ Marco Ciccone ⋅ Colin Raffel

Jul 9, 10:30 AM - 10:45 AM HALL C

Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we release fourteen pre-trained models that use different off-the-shelf tokenizers but are otherwise identical, using the same architecture, dataset, training budget, and initialization. We also release a multilingual robustness benchmark that measures model performance under real-world perturbations in English, Chinese, Farsi, Italian, and Turkish, curated by native annotators. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

View full details

Oral

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

yuanyuan gao ⋅ Hao Li ⋅ Yifei Liu ⋅ Xinhao Ji ⋅ Yuning Gong ⋅ Yuanjun Liao ⋅ Fangfu Liu ⋅ Manyuan Zhang ⋅ Yuchen Yang ⋅ Dan Xu ⋅ Xue Yang ⋅ Huaxi Huang ⋅ Hongjie Zhang ⋅ Ziwei Liu ⋅ Xiao Sun ⋅ Dingwen Zhang ⋅ Zhihang Zhong

Jul 9, 10:30 AM - 10:45 AM ASEM BALLROOM 201-203

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question–answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose \textbf{Holi-Spatial}, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question–Answer (QA) pairs. Following a principled and systematic pipeline, we further construct \textbf{Holi-Spatial-4M}, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

View full details

Oral

SpatioLM: Towards General Physical Spatial Intelligence in Vision-Language Models

jing wu ⋅ Jianhua Wu ⋅ Jiayi Guan ⋅ Jiahong Chen ⋅ Jinghui Lu ⋅ Hangjun Ye ⋅ Bingzhao Gao ⋅ Long Chen

Jul 9, 10:45 AM - 11:00 AM ASEM BALLROOM 201-203

Vision-Language Models (VLMs) perform well on commonsense reasoning tasks but struggle with visual spatial reasoning. Most existing solutions introduce extra 3D prior inputs or external spatial encoders, which increase complexity and degrade the underlying VLMs' general-purpose capabilities after spatial fine-tuning. To this end, we propose a parameter-efficient \textit{\textbf{Spatio}-vision \textbf{L}anguage \textbf{M}odels (SpatioLM)}, that enhances spatial intelligence without extra 3D prior inputs or third-party spatial encoders. Concretely, we design a plug-and-play and non-invasive spatio-vision module that elicits the spatial knowledge inherent in VLMs. Furthermore, we innovatively leverage pseudo depth and camera information as supervision to guide the model in learning physically coherent representations. Extensive experiments show that SpatioLM achieves significant improvements in diverse tasks, including spatial perception and understanding while effectively limiting the degradation of general capabilities. Notably, the model achieves an impressive score of 71.6 on the VSI-Bench (the first model to surpass 70). In addition, it attains competitive performance when transferred to embodied manipulation tasks.

View full details

Oral

WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu ⋅ Minghua He ⋅ Shaoxun Zeng ⋅ Sijun Zhang ⋅ Linhao Zhang ⋅ Chuhan Wu ⋅ Wei Jia ⋅ Yuan Liu ⋅ Zhou Xiao ⋅ Jie Zhou

Jul 9, 10:45 AM - 11:00 AM HALL C

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all observed tokens while keeping a causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3× on challenging reasoning benchmarks and up to 10× in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings.

View full details

Oral

Transforming Weather Data from Pixel to Latent Space

Sijie Zhao ⋅ Feng Liu ⋅ Xueliang Zhang ⋅ Hao Chen ⋅ Tao Han ⋅ JUNCHAO GONG ⋅ Ran Tao ⋅ Pengfeng Xiao ⋅ Xinyu Gu ⋅ LEI BAI

Jul 9, 10:45 AM - 11:00 AM HALL B2

The increasing impact of climate change and extreme weather events has spurred growing interest in deep learning for weather research. However, existing studies often rely on weather data in pixel space, which presents several challenges such as smooth outputs in model outputs, limited applicability to a single pressure-variable subset (PVS), and high data storage and computational costs. To address these challenges, we propose a novel Weather Latent Autoencoder (WLA) that transforms weather data from pixel space to latent space, enabling efficient data representation. By decoupling weather reconstruction from downstream tasks, WLA improves the accuracy and sharpness of weather task model results. The incorporated Pressure-Variable Unified Module transforms multiple PVS into a unified representation, enhancing the adaptability of the model in multiple weather scenarios. Furthermore, weather tasks can be performed in a low-storage latent space of WLA rather than a high-storage pixel space, thus significantly reducing data storage and computational costs. Through extensive experimentation, we demonstrate its superior compression and reconstruction performance, enabling the creation of the ERA5-Latent dataset with unified representations of multiple PVS from ERA5 data. The compressed full PVS in the ERA5-Latent dataset reduces the original 244.34 TB of data to 0.43 TB. The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space.

View full details

Oral

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

Jeong Woon Lee ⋅ Kyoleen Kwak ⋅ Daeho Kim ⋅ Hyoseok Hwang

Jul 9, 10:45 AM - 11:00 AM HALL D2

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

View full details

Oral

Towards Hierarchy–Uniformity Equilibrium: Recovering Semantic Depth in Hypergraph Contrastive Learning

Ruiting Zhao ⋅ Ming Li ⋅ Lixin Cui ⋅ Lu Bai ⋅ Feilong Cao ⋅ Ke Lv ⋅ Pietro Lió

Jul 9, 10:45 AM - 11:00 AM AUDITORIUM

Hypergraph contrastive learning is an effective paradigm for representation learning on higher-order relational data, yet existing methods largely ignore that hyperedges link nodes with multi-level semantics. Standard contrastive objectives emphasize instance discrimination via hyperspherical uniformity and tend to push embeddings apart in an indiscriminate manner. We show that this leads to a *Hierarchy–Uniformity Conflict*, whose geometric manifestation is *Semantic Flattening*, where the semantic depth of hyperedges collapses into a nearly flat cloud of instances. To address this issue, we introduce **HyperDepth**, a hypergraph contrastive learning framework that moves representations towards a hierarchy–uniformity equilibrium by jointly coordinating spectral and geometric signals. HyperDepth employs a decoupled spectral encoding scheme with adaptive gating so that high-frequency components focus on local instance discrimination while low-frequency components capture global hierarchical structure. On top of this, an energy-based hierarchical alignment module attaches a learnable prototype tree to the representation space and minimizes an interpretable energy functional to recover the semantic depth of hyperedges. Theoretically, under a mild frequency-separation assumption, we show that the local contrastive and global hierarchical objectives operate on orthogonal spectral components and admit equilibrium embeddings that preserve semantic depth while still retaining instance-level discrimination. Experiments on 15 hypergraph datasets and 17 supervised and self-supervised baselines, spanning homophilic and heterophilic regimes, show that HyperDepth attains strong performance with the best average rank.

View full details

Oral

Robust Contextual Optimization with Missing Covariates

Qingyuan Xu ⋅ Ruiwei Jiang

Jul 9, 10:45 AM - 11:00 AM GRAND BALLROOM 101-105

Modern decision-making increasingly relies on contextual features (covariates) to improve optimization under uncertainty. In practice, however, such covariates are often only partially observed due to, e.g., data source heterogeneity or costly data collection. Nonetheless, most existing methods assume fully observed historical data and can become unreliable when this assumption is violated. We address this gap by proposing a distributionally robust optimization approach that exploits incomplete covariates to produce robust decisions without imputing a complete dataset. Our method builds ambiguity sets from the observed partial data and incorporates the general structure of the missingness mechanism, ensuring candidate distributions remain consistent with what is observed. Across settings with discrete or continuous covariates and outcomes, we derive tractable reformulations and establish finite-sample out-of-sample performance guarantees. Empirical results across a range of contextual decision-making tasks demonstrate that the proposed integrated approach consistently outperforms state-of-the-art baselines, including various impute-then-optimize pipelines, in both out-of-sample performance and reliability.

View full details

Oral

Rare Event Analysis of Large Language Models

Jake McAllister Dorman ⋅ Edward Gillman ⋅ Dominic C Rose ⋅ Jamie Mair ⋅ Juan Garrahan

Jul 9, 10:45 AM - 11:00 AM HALL D1

Being probabilistic models, during inference large language models (LLMs) display *rare events*: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

View full details

Oral

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

John Cooper ⋅ Ilias Diakonikolas ⋅ Mingchen Ma ⋅ Frederic Sala

Jul 9, 4:00 PM - 4:15 PM HALL C

Hybrid sequence models—combining Transformer and state-space model layers—seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where—and underlying mechanisms through which—they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family—namely selective copying and associative recall—we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned—rather than constructed—hybrids outperform non-hybrid models with up to $6 \times$ as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

View full details

Oral

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres ⋅ Honghua Dong ⋅ Soham Ray ⋅ Xujie Si ⋅ Karthik Narasimhan

Jul 9, 4:00 PM - 4:15 PM HALL B2

Existing benchmarks for conversational AI agents simulate _single-control_ environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: (1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication; (2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity; (3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity; (4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions. Code, data, and leaderboard are available at https://taubench.com/.

View full details

Oral

Position: Irresponsible AI: big tech’s influence on AI research and associated impacts

Alex Hernandez-Garcia ⋅ Alexandra Volokhova ⋅ Ezekiel Williams ⋅ Dounia Shaaban Kabakibo ⋅ Mélisande Teng

Jul 9, 4:00 PM - 4:15 PM GRAND BALLROOM 101-105

The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. We develop this argument by laying out the factors through which this influence leads to irresponsible AI. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we highlight the need for AI researchers to counter big tech's influence, and review and propose strategies that build on the responsibility of implicated actors and collective action.

View full details

Oral

DOUBT: Decoupled Object-level Understanding and Bridging via vMF-based Trustworthiness for Hallucination Detection in MLLMs

Kaiqi Chen ⋅ Yang Qin ⋅ Changhao He ⋅ Xi Peng ⋅ Peng Hu

Jul 9, 4:00 PM - 4:15 PM HALL D2

Multimodal Large Language Models (MLLMs) frequently produce hallucinations (i.e., assertions that contradict the image or facts), undermining reliability in high-risk applications. Existing detection approaches typically feed images and texts jointly and estimate hallucination scores by measuring the consistency of model outputs. However, because the visual module often lags behind the language module in understanding and reasoning, MLLMs can repeatedly produce similar yet incorrect answers, yielding overestimated trustworthiness and missed detections. To address this, we propose a simple yet effective model-agnostic method, dubbed Decoupled Object-level Understanding and Bridging via vMF-based Trustworthiness (DOUBT). DOUBT first employs Object-level Understanding and Bridging (OUB), a two-step prompting scheme that decouples object recognition from relational reasoning by prompting the model to identify objects and then reason based on them. It further introduces a von Mises-Fisher (vMF)-based trustworthiness metric, which is more stable than semantic entropy metrics in small-sample settings. Extensive experiments and ablation studies on multiple benchmarks show that DOUBT consistently outperforms state-of-the-art baselines, demonstrating its robustness and generalizability for hallucination detection in MLLMs. The code is available at https://github.com/XLearning-SCU/2026-ICML-DOUBT.

View full details

Oral

Faster Activation Functions at the Edge for Post-Training Speedups

Anton Lydike ⋅ Jun Bi ⋅ Jackson Woodruff

Jul 9, 4:00 PM - 4:15 PM HALL D1

On-device AI has gained significant attention for enabling efficient, low-latency inference on edge devices. However, tight resource constraints on these platforms make the deployment of accurate and lightweight deep learning models challenging. In particular, advanced activation functions (AFs) like Swish and GELU often incur high inference overhead due to the lack of hardware fast-paths for exponentiation and division, restricting edge-ML applications to simple AFs like ReLU, limiting model accuracy. To address this, we propose FFCC, a compiler that automatically generates efficient approximations of AFs through floating-point reinterpretation. These functions do not require hardware fast-paths, meaning they remain fast on edge devices, but are accurate enough to be used as post-training drop-ins. FFCC takes a specification of AFs using basic floating-point operators and applies derivation rules to lower these expressions into efficient instruction sequences. Our experiments show that FFCC provides fast approximations of AFs, achieving order-of-magnitude speed-ups over accurate baselines on Arm M7, Aarch64 and Intel platforms. Using ConvNeXt as an example, we demonstrate how these activation-level gains translate to end-to-end speed-ups, and do not result in significant loss of model accuracy.

View full details

Oral

Position: AI/ML Deepfake Research is Misaligned with AI Generated Non-Consensual Intimate Imagery (AIG-NCII)

Qiwei Li ⋅ Wells Lucas Santo ⋅ Sarita Schoenebeck ⋅ Eric Gilbert

Jul 9, 4:00 PM - 4:15 PM AUDITORIUM

AI-generated non-consensual intimate imagery (AIG-NCII) is not adequately addressed in AI/ML literature regarding AI-generated media, commonly referred to as "deepfakes". While research on deepfakes currently focuses on its epistemic harms—or harms relating to truth and authenticity—this is misaligned with the dominant reality of generative AI abuse involving sexualized imagery. We conduct a landscape analysis of highly-cited works to demonstrate that technical interventions addressing deepfakes almost entirely ignore AIG-NCII, limiting the research ecosystem to authenticity detection tools. In this position paper, we argue that existing interventions address viewer-centric epistemic harms, such as fraud or scams, but ignore subject-centric dignity harms, such as AIG-NCII. We illustrate that knowing an image is synthetic does not mitigate harms to subjects and may, in some cases, even exacerbate them. We conclude by offering recommendations to realign the field, including updating threat models to consider subject-centered harms and addressing AIG-NCII in AI safety research. Finally, we caution that researchers should only engage in this high-risk domain if they implement safety guardrails for both subjects and researchers and establish partnerships with domain experts in sexual violence prevention.

View full details

Oral

A Random Matrix Perspective on the Consistency of Diffusion Models

Binxu Wang ⋅ Jacob A Zavatone-Veth ⋅ Cengiz Pehlevan

Jul 9, 4:00 PM - 4:15 PM ASEM BALLROOM 201-203

Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $\sigma^2\to\kappa(\sigma^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.

View full details

Oral

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

Felix X.-F. Ye ⋅ Xingjie Li ⋅ An Yu ⋅ Ming-Ching Chang ⋅ LINSONG CHU ⋅ Davis Wertheimer

Jul 9, 4:15 PM - 4:30 PM HALL D1

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present **FlashSinkhorn**, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn.

View full details

Oral

Equivalence of Context and Parameter Updates in Modern Transformer Blocks

Adrian Goldwaser ⋅ Michael Munn ⋅ Xavi Gonzalvo ⋅ Benoit Dherin

Jul 9, 4:15 PM - 4:30 PM ASEM BALLROOM 201-203

Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer function is output-controllable. This provides a simpler and more powerful lens for understanding how transformer models transmute prompts into effective weights. This setup generalizes to a wide range of modern LLM architectures including gating, pre-/post-norm, mixture of experts and sequential/parallel transformer blocks.

View full details

Oral

GoodDiffusion: Proactive Copyright Protection for Diffusion Bridge Models via Learnable Sample-specific Signatures

Shixi Qin ⋅ zhiyong yang ⋅ Shilong Bao ⋅ Zitai Wang ⋅ Qianqian Xu ⋅ Qingming Huang

Jul 9, 4:15 PM - 4:30 PM AUDITORIUM

This paper tackles the challenging problem of developing a proactive copyright protection mechanism that cuts off unauthorized use of diffusion bridge models. Existing studies largely fall into post-hoc attribution (e.g., watermarking and fingerprinting) or degradation-only defenses, which offer only indirect and limited preventive effect. We therefore propose GoodDiffusion, inspired by backdoor mechanisms, to enforce model-level use-time control by internalizing authorization into the generative process through a selectively permissive, otherwise closed behavior. Specifically, GoodDiffusion preserves high-quality generation for authorized queries carrying valid signatures, yet refuses to generate for unauthorized inputs. We further empirically show that naive static-signature designs (like conventional backdoor injection) are fundamentally fragile, since a surrogate signature can be efficiently recovered via gradient-based optimization. To strengthen security, we introduce a Learnable Signature Network (LSN) that assigns sample-specific signatures conditioned on each input. This breaks the universality of signatures and prevents a surrogate from transferring across inputs. Extensive experiments validate that GoodDiffusion effectively blocks unauthorized use while maintaining strong generation quality for authorized users.

View full details

Oral

Learning to Theorize the World from Observation

Doojin Baek ⋅ Gyubin Lee ⋅ Junyeob Baek ⋅ Hosung Lee ⋅ Sungjin Ahn

Jul 9, 4:15 PM - 4:30 PM HALL D2

What does it mean to understand the world? Is it simply to predict future video frames? Developmental cognitive science suggests that understanding the world is fundamentally the process of constructing internal theories of how it works rather than mere prediction, even before language is acquired. However, in machine learning, it remains unclear how to endow AI systems with such theory-building capability from raw, non-textual observation alone. In this paper, we introduce Learning-to-Theorize (L2T), a learning paradigm in which an AI system acquires the ability to construct theories represented as executable programs directly from observation alone. To instantiate this paradigm, we propose the Neural Language-of-Thought Programmer, a neural model that induces and executes latent programs as explanations rather than task-specific predictors or policies. In experiments, we show that this formulation enables explanation-driven generalization, allowing observations to be understood in terms of the programs that generate them.

View full details

Oral

Measuring Agents in Production

Melissa Pan ⋅ Negar Arabzadeh ⋅ Riccardo Cogo ⋅ Yuxuan Zhu ⋅ Alexander Xiong ⋅ Lakshya A Agrawal ⋅ Huanzhi Mao ⋅ Emma Shen ⋅ Sid Pallerla ⋅ Liana Patel ⋅ Shu Liu ⋅ Tianneng Shi ⋅ Xiaoyuan Liu ⋅ Jared Davis ⋅ Emmanuele Lacavalla ⋅ Alessandro Basile ⋅ Shuyi Yang ⋅ Paul Castro ⋅ Daniel Kang ⋅ Koushik Sen ⋅ Dawn Song ⋅ Joseph E Gonzalez ⋅ Ion Stoica ⋅ Matei Zaharia ⋅ Marquita Ellis

Jul 9, 4:15 PM - 4:30 PM HALL B2

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of **M**easuring **A**gents in **P**roduction, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

View full details

Oral

How much can language models memorize?

John Morris ⋅ Chawin Sitawarin ⋅ Chuan Guo ⋅ Narine Kokhlikyan ⋅ Edward Suh ⋅ Alexander Rush ⋅ Kamalika Chaudhuri ⋅ Saeed Mahloujifar

Jul 9, 4:15 PM - 4:30 PM HALL C

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from 500K to 1.5B parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

View full details

Oral

Equilibrium Pricing in Oligopolistic Data Markets

Bhaskar Ray Chaudhury ⋅ Jugal Garg ⋅ Eklavya Sharma ⋅ Jiaxin Song

Jul 9, 4:15 PM - 4:30 PM GRAND BALLROOM 101-105

We study equilibrium pricing in oligopolistic data markets with budget-constrained buyers (e.g., ML companies purchasing data to improve model accuracy) and strategic data sellers. Sellers compete by setting prices for their datasets, giving rise to a pricing game whose pure Nash equilibria correspond to equilibrium prices. While equilibrium prices are guaranteed for rivalrous goods via competitive equilibrium, we show that the non-rivalry of data fundamentally alters this picture: an exact Nash equilibrium need not exist, and in fact no 1.364-approximate equilibrium exists under uniform pricing. We therefore investigate relaxed equilibrium notions. Allowing sellers to use beyond-uniform pricing—specifically, piecewise-linear convex pricing functions—guarantees approximate stability within a constant factor: there exists a pricing profile in which no seller can improve revenue by a factor of two by deviating to any uniform price (a 2-approximate Nash equilibrium). Finally, our simulations demonstrate fast convergence and empirical approximation guarantees that outperform the worst-case bound of 2.

View full details

Oral

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

Hanlin Zhang ⋅ Jikai Jin ⋅ Vasilis Syrgkanis ⋅ Sham Kakade

Jul 9, 4:30 PM - 4:45 PM HALL C

Machine learning model performance arises from competition and application. For deployment, we consider the prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries—high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near-full-data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries move.

View full details

Oral

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Xianzhen Luo ⋅ Jingyuan Zhang ⋅ Shiqi Zhou ⋅ JinYang Huang ⋅ Chuan Xiao ⋅ Qingfu Zhu ⋅ Zhiyuan Ma ⋅ YUE XING ⋅ Yang Yue ⋅ Wencong Zeng ⋅ Wanxiang Che

Jul 9, 4:30 PM - 4:45 PM HALL B2

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source all code, data, and models.

View full details

Oral

Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and Computation

Philip Jordan ⋅ Maryam Kamgarpour

Jul 9, 4:30 PM - 4:45 PM GRAND BALLROOM 101-105

We study the existence and computation of Nash equilibria in concave games where the players' admissible strategies are subject to shared coupling constraints. Under playerwise concavity of constraints, we prove existence of Nash equilibria. Our proof leverages topological fixed point theory and novel structural insights into the contractibility of feasible sets, and relaxes strong assumptions for existence in prior work. Having established existence, we address the question of whether in the presence of coupling constraints, playerwise independent learning dynamics have convergence guarantees. We address this positively for the class of potential games by designing a convergent algorithm. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an $\epsilon$-approximate constrained Nash equilibrium within $\mathcal{O}(\epsilon^{-3})$ iterations.

View full details

Oral

Orthogonal Concept Erasure for Diffusion Models

Yuhao Sun ⋅ Lingyun Yu ⋅ Hao-Xiang Xu ⋅ Fengyuan Miao ⋅ Zhuoer Xu ⋅ Hongtao Xie

Jul 9, 4:30 PM - 4:45 PM AUDITORIUM

Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While training-based methods are effective, their high computational cost limits scalability. Editing-based methods are more efficient and deployment-friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing-based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend on *neuron direction* rather than *neuron magnitude*, while overall generative capacity relies on the *angular geometry* of neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we propose **Orthogonal Concept Erasure (OCE)**, which reformulates editing-based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer-wise orthogonal transformations derived from a closed-form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi-concept erasure, OCE introduces a subspace-level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single- and multi-concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non-target preservation, erasing up to 100 concepts in 4.3 s. Code: https://github.com/HansSunY/OCE.

View full details

Oral

Focus and Dilution: The Multi-stage Learning Process of Attention

Zheng-An Chen ⋅ Pengxiao Lin ⋅ Zhi-Qin John Xu ⋅ Tao Luo

Jul 9, 4:30 PM - 4:45 PM ASEM BALLROOM 201-203

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus–dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus–dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.

View full details

Oral

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Etienne Casanova ⋅ Rafal Kocielnik ⋅ R. Michael Alvarez

Jul 9, 4:30 PM - 4:45 PM HALL D2

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (``decision stickiness''), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8\%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial $r=+0.41$), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

View full details

Oral

FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

Rajat Vadiraj Dwaraknath ⋅ Sungyoon Kim ⋅ Mert Pilanci

Jul 9, 4:30 PM - 4:45 PM HALL D1

Sparse sketches such as the sparse Johnson–Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch–kernel co-design approach: we design a new family of sparse sketches, BlockPerm-SJLT, whose sparsity structure is chosen to enable FlashSketch, a corresponding optimized CUDA kernel that implements these sketches efficiently. The design of BlockPerm-SJLT introduces a tunable parameter that explicitly trades off the tension between GPU-efficiency and sketching robustness. We provide theoretical guarantees for BlockPerm-SJLT under the oblivious subspace embedding (OSE) framework, and also analyze the effect of the tunable parameter on sketching quality. We empirically evaluate FlashSketch on standard RandNLA benchmarks, as well as an end-to-end ML data attribution pipeline called GraSS. FlashSketch pushes the Pareto frontier of sketching quality versus speed, across a range of regimes and tasks, and achieves a global geomean speedup of roughly $1.7 \times$ over the prior state-of-the-art GPU sketches.

View full details

Oral

SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Anselm Paulus ⋅ Andreas René Geist ⋅ Vit Musil ⋅ Sebastian Hoffmann ⋅ Georg Martius

Jul 9, 4:45 PM - 5:00 PM HALL D1

Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many ''hard'' primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous ''soft'' relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introduces **SoftJAX** and **SoftTorch**, open-source, feature-complete libraries for *soft differentiable programming*. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such as *clip* or *abs*, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such as *sort* or *rank* -- based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study. Code is available at github.com/a-paulus/softjax and github.com/a-paulus/softtorch.

View full details

Oral

Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

Lei Wang ⋅ Wenxiang Diao ⋅ Andrew Busch ⋅ Jun Zhou ⋅ Yongsheng Gao

Jul 9, 4:45 PM - 5:00 PM AUDITORIUM

Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.

View full details

Oral

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Wenbo Pan ⋅ Zhichao Liu ⋅ Xianlong Wang ⋅ Yu Haining ⋅ Xiaohua Jia

Jul 9, 4:45 PM - 5:00 PM HALL D2

Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of $M$ tokens within a context of length $N$ requires $\mathcal{O}(M \cdot N)$ operations, making long-context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over $130\times$ speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

View full details

Oral

Which Algorithms Can Graph Neural Networks Learn?

Solveig Wittig ⋅ Antonis Vasileiou ⋅ Robert R. Nerem ⋅ Timo Stoll ⋅ Floris Geerts ⋅ Yusu Wang ⋅ Christopher Morris

Jul 9, 4:45 PM - 5:00 PM ASEM BALLROOM 201-203

In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, much existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size with worst-case guarantees. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them and derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman–Ford algorithm, yielding substantially smaller required training sets and significantly extending the recent work of Nerem et al., 2025 by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

View full details

Oral

What Preferences Can—and Cannot—Predict in Multi-Agent Online Learning

Omar Abbadi ⋅ Rida Laraki ⋅ Panayotis Mertikopoulos

Jul 9, 4:45 PM - 5:00 PM GRAND BALLROOM 101-105

We examine the interplay between ordinal, preference-based solution concepts in games and the outcomes of payoff-driven learning dynamics, asking to what extent the combinatorial data of a game—its *preference graph*—can predict the long-run behavior of no-regret dynamics such as *follow-the-regularized-leader* (FTRL). In one direction, we show that the skeleton of every *dynamically stable* set (i.e., the set of pure profiles it contains) must also be *preferentially stable*, that is, it must be closed under profitable deviations. We then ask the converse question: when are preferences sufficient to describe the long-run behavior of the players' learning dynamics? We begin by showing that preferences are indeed enough to fully characterize asymptotic stability in the case of *subgames*—i.e., subsets of pure profiles obtained by restricting players' action sets. Beyond this case however, the equivalence between dynamic and preferential stability breaks down: in particular, we construct a three-player game with a preferentially stable set whose span is dynamically *unstable*, showing that preferences are *not sufficient* to describe dynamically stable behavior in general. To restore stability, we introduce the notion of *leaklessness*, a measure of aggregate payoff drift away from a set of pure profiles, and we use it to identify a payoff-based condition guaranteeing that the span of a set of pure profiles is stable and attracting.

View full details

Oral

Procedural Pretraining: Warming Up Language Models with Abstract Data

Liangze Jiang ⋅ Zachary Shinnick ⋅ Anton Hengel ⋅ Hemanth Saratchandran ⋅ Damien Teney

Jul 9, 4:45 PM - 5:00 PM HALL C

Pretraining language models directly on web-scale corpora is the de facto paradigm. We study an alternative where the model is initially exposed to *abstract structured data* to ease the subsequent acquisition of rich semantic knowledge, much like humans learning simple logic and mathematics before higher reasoning. We focus on *procedural data*, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, the accuracy of context recall (Needle-in-a-haystack) jumps from 10 to 98% when a model is pretrained on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1–0.3% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this also enables the models to reach the same loss value with only 55/67/86% of the original data and thus a comparable reduction in FLOPs. Third, we explore the mechanisms behind the benefits and find that procedural pretraining instills non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means of improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

View full details

Oral

OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration

Shijun Li ⋅ Hilaf Hasson ⋅ Joydeep Ghosh

Jul 9, 4:45 PM - 5:00 PM HALL B2

Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on diverse tasks against recent approaches. Codes are available at: https://github.com/xiwenchao/OMAC.

View full details