Poster
Jiahai Feng · Stuart Russell · Jacob Steinhardt
[ East Exhibition Hall A-B ]
Abstract
Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on "John Doe lives in Tokyo," LMs correctly answer "What language do the people in John Doe's city speak?'' with "Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining.We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be grafted to predict counterfactual implications.We empirically show these effects in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models.Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of …
Poster
Jing Huang · Junyi Tao · Thomas Icard · Diyi Yang · Christopher Potts
[ East Exhibition Hall A-B ]
Abstract
Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks—including symbol manipulation, knowledge retrieval, and instruction following—we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.
Poster
Stephan Rabanser · Ali Shahin Shamsabadi · Olive Franzese · Xiao Wang · Adrian Weller · Nicolas Papernot
[ East Exhibition Hall A-B ]
Abstract
Cautious predictions—where a machine learning model abstains when uncertain—are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model’s proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.
Poster
Shelvia Wongso · Rohan Ghosh · Mehul Motani
[ East Exhibition Hall A-B ]
Abstract
Estimating the confidence of deep neural network predictions is crucial for safe deployment in high-stakes applications. While softmax probabilities are commonly used, they are often poorly calibrated, and existing calibration methods have been shown to be detrimental to failure prediction. In this paper, we propose using information-theoretic measures to estimate prediction confidence in a post-hoc manner, without modifying network architecture or training. Specifically, we compare three pointwise information (PI) measures: pointwise mutual information (PMI), pointwise V-information (PVI), and the recently proposed pointwise sliced mutual information (PSI). These measures are theoretically grounded in their relevance to predictive uncertainty, with properties such as invariance, convergence rates, and sensitivity to geometric attributes like margin and intrinsic dimensionality. Through extensive experiments on benchmark computer vision models and datasets, we find that PVI consistently outperforms or matches existing baselines in post-hoc confidence estimation, excelling over PMI and PSI in both failure prediction and calibration. This aligns with our theoretical insights, which suggest that PVI offers the most balanced trade-offs. Finally, we show that replacing softmax outputs with PVI in existing confidence estimation methods can further enhance performance.
Poster
Hayden McTavish · Zachery Boner · Jon Donnelly · Margo Seltzer · Cynthia Rudin
[ East Exhibition Hall A-B ]
Abstract
Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree's decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence's impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.
Poster
Xinyang Li · Siqi Liu · Bochao Zou · Jiansheng Chen · Huimin Ma
[ East Exhibition Hall A-B ]
Abstract
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model’s exhibited ToM by adjusting in the direction of the attention head.
Poster
Felix Hofstätter · Teun van der Weij · Jayden Teoh · Rada Djoneva · Henning Bartsch · Francis Rhys Ward
[ East Exhibition Hall A-B ]
Abstract
Capability evaluations are required to understand and regulate AI systems that maybe deployed or further developed. Therefore, it is important that evaluations providean accurate estimation of an AI system’s capabilities. However, in numerous cases,previously latent capabilities have been elicited from models, sometimes longafter initial release. Accordingly, substantial efforts have been made to developmethods for eliciting latent capabilities from models. In this paper, we evaluate theeffectiveness of capability elicitation techniques by intentionally training modelorganisms – language models with hidden capabilities that are revealed by apassword. We introduce a novel method for training model organisms, basedon circuit-breaking, which is more robust to elicitation techniques than standardpassword-locked models. We focus on elicitation techniques based on promptingand activation steering, and compare these to fine-tuning methods. Promptingtechniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. Fora code-generation task, only fine-tuning can elicit the hidden capabilities of ournovel model organism. Additionally, our results suggest that combining techniquesimproves elicitation. Still, if possible, fine-tuning should be the method of choiceto improve the trustworthiness of capability evaluations.
Poster
Jianwei Li · Jung-Eun Kim
[ East Exhibition Hall A-B ]
Abstract
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.
Poster
Haruka Kiyohara · Fan Yao · Sarah Dean
[ East Exhibition Hall A-B ]
Abstract
In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics: viewers benefit from increases in provider populations, while providers benefit from increases in viewer population. Despite the importance of such “population effects” on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and recommender policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of the standard “myopic-greedy” policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments. Our experiment code is available at https://github.com/sdean-group/dynamics-two-sided-market.
Poster
Kunwoong Kim · Jihu Lee · Sangchul Park · Yongdai Kim
[ East Exhibition Hall A-B ]
Abstract
Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute.While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice.To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$-means clustering objective function.The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space.A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice.Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
Poster
Xinyue Xu · Yueying Hu · Hui Tang · Yi Qin · Lu Mi · Hao Wang · Xiaomeng Li
[ East Exhibition Hall A-B ]
Abstract
Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.
Poster
Jen-Tse Huang · Jiaxu Zhou · Tailin Jin · Xuhui Zhou · Zixi Chen · Wenxuan Wang · Youliang Yuan · Michael Lyu · Maarten Sap
[ East Exhibition Hall A-B ]
Abstract
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, the impact of clumsy or even malicious agents—those who frequently make errors in their tasks—on the overall performance of the system remains underexplored. This paper investigates: (1) What is the resilience of various system structures (e.g., A$\rightarrow$B$\rightarrow$C, A$\leftrightarrow$B$\leftrightarrow$C) under faulty agents, on different downstream tasks? (2) How can we increase system resilience to defend against these agents? To simulate faulty agents, we propose two approaches—AutoTransform and AutoInject—which introduce mistakes into the agents' responses. Experiments on four downstream tasks using six systems show that the "hierarchical" structure, i.e., A$\rightarrow$(B$\leftrightarrow$C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures. To further improve resilience, we introduce (1) Challenger, that introduces a mechanism for each agent to challenge others' outputs, and (2) Inspector, an additional agent to review and correct messages, recovering up to 96.4% errors made by faulty agents. Our code and data are available at https://github.com/CUHK-ARISE/MAS-Resilience.
Poster
Elliot Myunghoon Kim · Avi Garg · Kenny Peng · Nikhil Garg
[ East Exhibition Hall A-B ]
Abstract
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60\% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.
Poster
Saksham Rastogi · Pratyush Maini · Danish Pruthi
[ East Exhibition Hall A-B ]
Abstract
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership—i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
Poster
Kian Ming Chai · Edwin V. Bonilla
[ East Exhibition Hall A-B ]
Abstract
We introduce a novel one-parameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical construction and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains higher evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.
Poster
Chuanhui Liu · Xiao Wang
[ East Exhibition Hall A-B ]
Abstract
This paper provides a comprehensive analysis of variational inference in latent variable models for survival analysis, emphasizing the distinctive challenges associated with applying variational methods to survival data. We identify a critical weakness in the existing methodology, demonstrating how a poorly designed variational distribution may hinder the objective of survival analysis tasks—modeling time-to-event distributions. We prove that the optimal variational distribution, which perfectly bounds the log-likelihood, may depend on the censoring mechanism. To address this issue, we propose censor-dependent variational inference (CDVI), tailored for latent variable models in survival analysis. More practically, we introduce CD-CVAE, a V-structure Variational Autoencoder (VAE) designed for the scalable implementation of CDVI. Further discussion extends some existing theories and training techniques to survival analysis. Extensive experiments validate our analysis and demonstrate significant improvements in the estimation of individual survival distributions.
Poster
Zhengming Chen · Yewei Xia · Feng Xie · Jie Qiao · Zhifeng Hao · Ruichu Cai · Kun Zhang
[ East Exhibition Hall A-B ]
Abstract
We study the problem of learning discrete latent variable causal structures from mixed-type observational data. Traditional methods, such as those based on the tensor rank condition, are designed to identify discrete latent structure models and provide robust identification bounds for discrete causal models. However, when observed variables—specifically, those representing the children of latent variables—are collected at various levels with continuous data types, the tensor rank condition is not applicable, limiting further causal structure learning for latent variables. In this paper, we consider a more general case where observed variables can be either continuous or discrete, and further allow for scenarios where multiple latent parents cause the same set of observed variables. We show that, under the completeness condition, it is possible to discretize the data in a way that satisfies the full-rank assumption required by the tensor rank condition. This enables the identifiability of discrete latent structure models within mixed-type observational data. Moreover, we introduce the two-sufficient measurement condition, a more general structural assumption under which the tensor rank condition holds and the underlying latent causal structure is identifiable by a proposed two-stage identification algorithm. Extensive experiments on both simulated and real-world data validate the effectiveness of our method.
Spotlight Poster
Hugh Dance · Pierre Glaser · Peter Orbanz · Ryan P. Adams
[ East Exhibition Hall A-B ]
Abstract
With the advent of automatic vectorization tools (e.g., JAX's vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem---loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.
Poster
Michael Albergo · Eric Vanden-Eijnden
[ East Exhibition Hall A-B ]
Abstract
We introduce the Non-Equilibrium Transport Sampler (NETS), an algorithm for sampling from unnormalized probability distributions. NETS builds on non-equilibrium sampling strategies that transport a simple base distribution into the target distribution in finite time, as pioneered in Neal's annealed importance sampling (AIS). In the continuous-time setting, this transport is accomplished by evolving walkers using Langevin dynamics with a time-dependent potential, while simultaneously evolving importance weights to debias their solutions following Jarzynski's equality. The key innovation of NETS is to add to the dynamics a learned drift term that offsets the need for these corrective weights by minimizing their variance through an objective that can be estimated without backpropagation and provably bounds the Kullback-Leibler divergence between the estimated and target distributions. NETS provides unbiased samples and features a tunable diffusion coefficient that can be adjusted after training to maximize the effective sample size. In experiments on standard benchmarks, high-dimensional Gaussian mixtures, and statistical lattice field theory models, NETS shows compelling performances.
Poster
Severi Rissanen · RuiKang OuYang · Jiajun He · Wenlin Chen · Markus Heinonen · Arno Solin · Jose Miguel Hernandez-Lobato
[ East Exhibition Hall A-B ]
Abstract
Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun---at considerable computational cost---whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency,outperforming diffusion-based neural samplers.
Poster
Alex Kokot · Octavian-Vlad Murad · Marina Meila
[ East Exhibition Hall A-B ]
Abstract
In this paper, we clarify the effect of noise on common spectrallymotivated algorithms such as Diffusion Maps (DM) for dimensionreduction. Empirically, these methods are much more robust to noisethan current work suggests. Specifically, existing consistency resultsrequire that either the noise amplitude or dimensionality must varywith the sample size $n$. We provide new theoretical resultsdemonstrating that low-frequency eigenpairs reliably capture thegeometry of the underlying manifold under a constant noise level, up to a dimension independent threshold $O(r^{-2})$, where $r$ is the noise amplitude. Our results rely on a decomposition of the manifold Laplacian in the Sasakimetric, a technique not used before in this area, to our knowledge. We experimentally validate our theoretical predictions. Additionally, we observesimilar robust behavior for other manifold learning algorithms which are not based on computing the Laplacian, namely LTSA and VAE.
Poster
Yifan Fang · Yifei Fang · Ruizhe Chen · Haote Xu · Xinghao Ding · Yue Huang
[ East Exhibition Hall A-B ]
Abstract
Frequency-domain image anomaly detection methods can substantially enhance anomaly detection performance, however, they still lack an interpretable theoretical framework to guarantee the effectiveness of the detection process. We propose a novel test to detect anomalies in structural image via a Demeaned Fourier transform (DFT) under factor model framework, and we proof its effectiveness. We also briefly give the asymptotic theories of our test, the asymptotic theory explains why the test can detect anomalies at both the image and pixel levels within the theoretical lower bound. Based on our test, we derive a module called Demeaned Fourier Sparse (DFS) that effectively enhances detection performance in unsupervised anomaly detection tasks, which can construct masks in the Fourier domain and utilize a distribution-free sampling method similar to the bootstrap method. The experimental results indicate that this module can accurately and efficiently generate effective masks for reconstruction-based anomaly detection tasks, thereby enhancing the performance of anomaly detection methods and validating the effectiveness of the theoretical framework.
Poster
Alessandro Licciardi · Davide Leo · Eros Fanì · Barbara Caputo · Marco Ciccone
[ East Exhibition Hall A-B ]
Abstract
Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance.Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL.In this work, we propose a novel clustering method for FL, **FedGWC** (Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters. **FedGWC** identifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce the *Wasserstein Adjusted Score*, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show that **FedGWC** outperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach.
Poster
Meshi Bashari · Matteo Sesia · Yaniv Romano
[ East Exhibition Hall A-B ]
Abstract
Conformal prediction is a flexible framework for calibrating machine learning predictions, providing distribution-free statistical guarantees. In outlier detection, this calibration relies on a reference set of labeled inlier data to control the type-I error rate. However, obtaining a perfectly labeled inlier reference set is often unrealistic, and a more practical scenario involves access to a contaminated reference set containing a small fraction of outliers. This paper analyzes the impact of such contamination on the validity of conformal methods. We prove that under realistic, non-adversarial settings, calibration on contaminated data yields conservative type-I error control, shedding light on the inherent robustness of conformal methods. This conservativeness, however, typically results in a loss of power. To alleviate this limitation, we propose a novel, active data-cleaning framework that leverages a limited labeling budget and an outlier detection model to selectively annotate data points in the contaminated reference set that are suspected as outliers. By removing only the annotated outliers in this ``suspicious'' subset, we can effectively enhance power while mitigating the risk of inflating the type-I error rate, as supported by our theoretical analysis. Experiments on real datasets validate the conservative behavior of conformal methods under contamination and show that the proposed data-cleaning …
Spotlight Poster
Guiomar Pescador-Barrios · Sarah Filippi · Mark van der Wilk
[ East Exhibition Hall A-B ]
Abstract
Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.
Poster
Arik Reuter · Tim G. J. Rudner · Vincent Fortuin · David Rügamer
[ East Exhibition Hall A-B ]
Abstract
Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context—without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://github.com/ArikReuter/ICL_for_Full_Bayesian_Inference
Poster
Stefano Cortinovis · Francois Caron
[ East Exhibition Hall A-B ]
Abstract
Prediction-powered inference (PPI) enables valid statistical inference by combining experimental data with machine learning predictions. When a sufficient number of high-quality predictions is available, PPI results in more accurate estimates and tighter confidence intervals than traditional methods. In this paper, we propose to inform the PPI framework with prior knowledge on the quality of the predictions. The resulting method, which we call frequentist, assisted by Bayes, PPI (FAB-PPI), improves over PPI when the observed prediction quality is likely under the prior, while maintaining its frequentist guarantees. Furthermore, when using heavy-tailed priors, FAB-PPI adaptively reverts to standard PPI in low prior probability regions. We demonstrate the benefits of FAB-PPI in real and synthetic examples.
Spotlight Poster
Emilia Magnani · Marvin Pförtner · Tobias Weber · Philipp Hennig
[ East Exhibition Hall A-B ]
Abstract
Neural operators generalize neural networks to learn mappings between function spaces from data. They are commonly used to learn solution operators of parametric partial differential equations (PDEs) or propagators of time-dependent PDEs. However, to make them useful in high-stakes simulation scenarios, their inherent predictive error must be quantified reliably. We introduce LUNO, a novel framework for approximate Bayesian uncertainty quantification in trained neural operators. Our approach leverages model linearization to push (Gaussian) weight-space uncertainty forward to the neural operator's predictions.We show that this can be interpreted as a probabilistic version of the concept of currying from functional programming, yielding a function-valued (Gaussian) random process belief. Our framework provides a practical yet theoretically sound way to apply existing Bayesian deep learning methods such as the linearized Laplace approximation to neural operators. Just as the underlying neural operator, our approach is resolution-agnostic by design.The method adds minimal prediction overhead, can be applied post-hoc without retraining the network, and scales to large models and datasets.We evaluate these aspects in a case study on Fourier neural operators.
Spotlight Poster
Tian-Zuo Wang · Wen-Bo Du · Zhi-Hua Zhou
[ East Exhibition Hall A-B ]
Abstract
A maximal ancestral graph (MAG) is widely used to characterize the causal relations among observable variables in the presence of latent variables. However, given observational data, only a partial ancestral graph representing a Markov equivalence class (MEC) of MAGs is identifiable, which generally contains uncertain causal relations. Due to the uncertainties, \emph{MAG listing}, \emph{i.e.}, listing all the MAGs in the MEC, is critical for many downstream tasks. In this paper, we present the first \emph{polynomial-delay} MAG listing method, where delay refers to the time for outputting each MAG, through introducing enumerated structural knowledge in the form of \emph{singleton background knowledge (BK)}. To incorporate such knowledge, we propose the \emph{sound} and \emph{locally complete} orientation rules. By recursively introducing singleton BK and applying the rules, our method can output all and only MAGs in the MEC with polynomial delay. Additionally, while the proposed novel rules enable more efficient MAG listing, for the goal of incorporating general BK, we present two counterexamples to imply that existing rules including ours, are not yet \emph{complete}, which motivate two more rules. Experimental results validate the efficiency of the proposed MAG listing method.
Poster
Bin Yang · Xiaojie Wang
[ East Exhibition Hall A-B ]
Abstract
Generating samples from a high dimensional probability distribution is a fundamental task with wide-ranging applications in the area of scientific computing, statistics and machine learning. This article revisits the popular Langevin Monte Carlo (LMC) sampling algorithms and provides a non-asymptotic error analysis in $\mathcal{W}_2$-distance in a non-convex setting. In particular, we prove an error bound $O(\sqrt{d} h)$, which guarantees a mixing time $ \tilde{O} (\sqrt{d} \epsilon^{-1})$ to achieve the accuracy tolerance $\epsilon$, under certain log-smooth conditions and the assumption that the target distribution satisfies a log-Sobolev inequality, as opposed to the strongly log-concave condition used in (Li et al., 2019; 2022). This bound matches the best one in the strongly log-concave case and improves upon the best-known convergence rates in non-convex settings. To prove it, we establish a new framework of uniform-in-time convergence for discretizations of SDEs. Distinct from (Li et al., 2019; 2022), we start from the finite-time mean-square fundamental convergence theorem, which combined with uniform-in-time moment bounds of LMC and the exponential ergodicity of SDEs in the non-convex setting gives the desired uniform-in-time convergence. Our framework also applies to the case when the gradient of the potential $U$ is non-globally Lipschitz with superlinear growth, for which modified LMC …
Spotlight Poster
Aditya Gorla · Ryan Wang · Zhengtong Liu · Ulzee An · Sriram Sankararaman
[ East Exhibition Hall A-B ]
Abstract
We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8\% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.
Poster
Bowen Deng · Lele Fu · Jialong Chen · Sheng Huang · Tianchi Liao · Zhang Tao · Chuan Chen
[ East Exhibition Hall A-B ]
Abstract
Generalized Category Discovery (GCD) aims to identify both known and novel categories in unlabeled data by leveraging knowledge from old classes. However, existing methods are limited to non-graph data; lack theoretical foundations to answer *When and how known classes can help GCD*. We introduce the Graph GCD task; provide the first rigorous theoretical analysis of *parametric GCD*. By quantifying the relationship between old and new classes in the embedding space using the Wasserstein distance W, we derive the first provable GCD loss bound based on W. This analysis highlights two necessary conditions for effective GCD. However, we uncover, through a Pairwise Markov Random Field perspective, that popular graph contrastive learning (GCL) methods inherently violate these conditions. To address this limitation, we propose SWIRL, a novel GCL method for GCD. Experimental results validate our (theoretical) findings and demonstrate SWIRL's effectiveness.
Poster
Sagar Shrestha · Xiao Fu
[ East Exhibition Hall A-B ]
Abstract
Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information---yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces *diversified flow matching* (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function's velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.
Poster
Jinyu Cai · Yunhe Zhang · Jicong Fan
[ East Exhibition Hall A-B ]
Abstract
Identifying anomalous graphs is essential in real-world scenarios such as molecular and social network analysis, yet anomalous samples are generally scarce and unavailable. This paper proposes a Self-Discriminative Modeling (SDM) framework that trains a deep neural network only on normal graphs to detect anomalous graphs. The neural network simultaneously learns to construct pseudo-anomalous graphs from normal graphs and learns an anomaly detector to recognize these pseudo-anomalous graphs. As a result, these pseudo-anomalous graphs interpolate between normal graphs and real anomalous graphs, which leads to a reliable decision boundary of anomaly detection. In this framework, we develop three algorithms with different computational efficiencies and stabilities for anomalous graph detection. Extensive experiments on 12 different graph benchmarks demonstrated that the three variants of SDM consistently outperform the state-of-the-art GLAD baselines. The success of our methods stems from the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provided new insights for graph-level anomaly detection.
Poster
Harit Vishwakarma · Yi Chen · Satya Sai Srinath Namburi GNVV · Sui Jiet Tay · Ramya Vinayak · Frederic Sala
[ East Exhibition Hall A-B ]
Abstract
Modern semi-supervised learning (SSL) methods rely on pseudolabeling and consistency regularization. Pseudolabeling is typically performed by comparing the model's confidence scores and a predefined threshold. While several heuristics have been proposed to improve threshold selection, the underlying issues of overconfidence and miscalibration in confidence scores remain largely unaddressed, leading to inaccurate pseudolabels, degraded test accuracy, and prolonged training. We take a first-principles approach to learn confidence scores and thresholds with an explicit knob for error. This flexible framework addresses the fundamental question of optimal scores and threshold selection in pseudolabeling. Moreover, it gives practitioners a principled way to control the quality and quantity of pseudolabels. Such control is vital in SSL, where balancing pseudolabel quality and quantity directly affects model performance and training efficiency. Our experiments show that, by integrating this framework with modern SSL methods, we achieve significant improvements in accuracy and training efficiency. In addition, we provide novel insights on the trade-offs between the choices of the error parameter and the end model's performance.
Poster
Vincent P. Grande · Michael Schaub
[ East Exhibition Hall A-B ]
Abstract
Topological Data Analysis (TDA) allows us to extract powerful topological and higher-order information on the global shape of a data set or point cloud.Tools like Persistent Homology give a single complex description of the global structure of the point cloud.However, common machine learning applications like classification require point-level information and features.In this paper, we bridge this gap and propose a novel method to extract node-level topological features from complex point clouds using discrete variants of concepts from algebraic topology and differential geometry.We verify the effectiveness of these topological point features (TOPF) on both synthetic and real-world data and study their robustness under noise and heterogeneous sampling.
Poster
Hanmo Liu · Shimin Di · Haoyang LI · Xun Jian · Yue Wang · Lei Chen
[ East Exhibition Hall A-B ]
Abstract
Node classification is a key task in temporal graph learning (TGL). Real-life temporal graphs often introduce new node classes over time, but existing TGL methods assume a fixed set of classes. This assumption brings limitations, as updating models with full data is costly, while focusing only on new classes results in forgetting old ones. Graph continual learning (GCL) methods mitigate forgetting using old-class subsets but fail to account for their evolution. We define this novel problem as temporal graph continual learning (TGCL), which focuses on efficiently maintaining up-to-date knowledge of old classes. To tackle TGCL, we propose a selective learning framework that substitutes the old-class data with its subsets, Learning Towards the Future (LTF). We derive an upper bound on the error caused by such replacement and transform it into objectives for selecting and learning subsets that minimize classification error while preserving the distribution of the full old-class data. Experiments on three real-world datasets show that LTF effectively addresses the TGCL challenge.
Poster
Atsutoshi Kumagai · Tomoharu Iwata · Hiroshi Takahashi · Taishi Nishiyama · Kazuki Adachi · Yasuhiro Fujiwara
[ East Exhibition Hall A-B ]
Abstract
Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced binary classification tasks. Existing AUC maximization methods typically assume that training and test distributions are identical. However, this assumption is often violated due to {\it a covariate shift}, where the input distribution can vary but the conditional distribution of the class label given the input remains unchanged. The importance weighting is a common approach to the covariate shift, which minimizes the test risk with importance-weighted training data. However, it cannot maximize the AUC. In this paper, to achieve this, we theoretically derive two estimators of the test AUC risk under the covariate shift by using positive and unlabeled (PU) data in the training distribution and unlabeled data in the test distribution. Our first estimator is calculated from importance-weighted PU data in the training distribution, and the second one is calculated from importance-weighted positive data in the training distribution and unlabeled data in the test distribution. We train classifiers by minimizing a weighted sum of the two AUC risk estimators that approximates the test AUC risk. Unlike the existing importance weighting, our method does not require negative labels and class-priors. We show the effectiveness of …
Poster
Daniel Marczak · Simone Magistri · Sebastian Cygert · Bartłomiej Twardowski · Andrew Bagdanov · Joost van de Weijer
[ East Exhibition Hall A-B ]
Abstract
Model merging integrates the weights of multiple task-specific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices -- weight update matrices applied to a pre-trained model -- that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance on vision and language tasks across various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training.
Poster
Chen Huang · Skyler Seto · Hadi Pouransari · Mehrdad Farajtabar · Raviteja Vemulapalli · Fartash Faghri · Oncel Tuzel · Barry-John Theobald · Joshua M Susskind
[ East Exhibition Hall A-B ]
Abstract
Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of *concept forgetting* on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method **Proxy-FDA** that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.
Poster
İlker Işık · Ramazan Gokberk Cinbis · Ebru Gol
[ East Exhibition Hall A-B ]
Abstract
Language models lack the notion of interchangeable tokens: symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model's ability to recognize alpha-equivalence, where renaming bound variables preserves meaning. We formalize this machine learning problem and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alpha-equivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains. Our code and project page are available at https://necrashter.github.io/interchangeable-token-embeddings
Poster
Ben Dai
[ East Exhibition Hall A-B ]
Abstract
Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely *EnsLoss*, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy" of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, *EnsLoss* enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of *EnsLoss* compared to fixed loss methods is demonstrated through experiments on a broad range of 45 pairs of CIFAR10 datasets, the PCam image dataset, and 14 OpenML tabular datasets and with various deep learning architectures. Python repository and source code …
Poster
Bing Liu · wenjun Miao · Boyu Zhang · Qiankun Zhang · Bin Yuan · Wang · Shenghao Liu · Xianjun Deng
[ East Exhibition Hall A-B ]
Abstract
Network quantization, one of the most widely studied model compression methods, effectively quantizes a floating-point model to obtain a fixed-point one with negligible accuracy loss. Although great success was achieved in reducing the model size, it may exacerbate the unfairness in model accuracy across different groups of datasets.This paper considers two widely used algorithms: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), with an attempt to understand how they cause this critical issue.Theoretical analysis with empirical verifications reveals two responsible factors, as well as how they influence a metric of fairness in depth.A comparison between PTQ and QAT is then made, explaining an observation that QAT behaves even worse than PTQ in fairness, although it often preserves a higher accuracy at lower bit-widths in quantization.Finally, the paper finds out that several simple data augmentation methods can be adopted to alleviate the disparate impacts of quantization, based on a further observation that class imbalance produces distinct values of the aforementioned factors among different attribute classes. We experiment on either imbalanced (UTK-Face and FER2013) or balanced (CIFAR-10 and MNIST) datasets using ResNet and VGG models for empirical evaluation.
Poster
Yuchen Guan · Runxi Cheng · Kang Liu · Chun Yuan
[ East Exhibition Hall A-B ]
Abstract
Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Poster
Anqi Mao · Mehryar Mohri · Yutao Zhong
[ East Exhibition Hall A-B ]
Abstract
In applications with significant class imbalance or asymmetric costs, metrics such as the $F_\beta$-measure, AM measure, Jaccard similarity coefficient, and weighted accuracy offer more suitable evaluation criteria than standard binary classification loss. However, optimizing these metrics present significant computational and statistical challenges. Existing approaches often rely on the characterization of the Bayes-optimal classifier, and use threshold-based methods that first estimate class probabilities and then seek an optimal threshold. This leads to algorithms that are not tailored to restricted hypothesis sets and lack finite-sample performance guarantees. In this work, we introduce principled algorithms for optimizing generalized metrics, supported by $H$-consistency and finite-sample generalization bounds. Our approach reformulates metric optimization as a generalized cost-sensitive learning problem, enabling the design of novel surrogate loss functions with provable $H$-consistency guarantees. Leveraging this framework, we develop new algorithms, METRO (*Metric Optimization*), with strong theoretical performance guarantees. We report the results of experiments demonstrating the effectiveness of our methods compared to prior baselines.
Poster
Yang Li · Jiale Ma · Yebin Yang · Qitian Wu · Hongyuan Zha · Junchi Yan
[ East Exhibition Hall A-B ]
Abstract
Predicting labels directly from data has been the standard in label learning tasks, e.g., supervised learning, where models often prioritize feature compression and extraction from inputs under the assumption that label information is less complex. However, recent prediction tasks often face predicting complex labels, exacerbating the challenge of learning mappings from learned features to high-fidelity label representations. To this end, we draw inspiration from the consistency training concept in generative consistency models and propose predictive consistency learning (PCL), a novel learning paradigm that decomposes the full label information into a progressive learning procedure, mitigating the label capture challenge. Besides data inputs, PCL additionally receives input from noise-perturbed labels as an additional reference, pursuing predictive consistency across different noise levels. It simultaneously learns the relationship between latent features and a spectrum of label information, which enables progressive learning for complex predictions and allows multi-step inference analogous to gradual denoising, thereby enhancing the prediction quality. Experiments on vision, text, and graph tasks show the superiority of PCL over conventional supervised training in complex label prediction tasks.
Poster
Yeseul Cho · Baekrok Shin · Changmin Kang · Chulhee Yun
[ East Exhibition Hall A-B ]
Abstract
Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce the **Difficulty and Uncertainty-Aware Lightweight (DUAL)** score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning ratio, we further propose a pruning ratio-adaptive sampling using Beta distribution.Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66\% compared to previous methods while achieving a SOTA 60\% test accuracy at a 90\% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15\% while maintaining SOTA performance.
Poster
Wenju Sun · Qingyong Li · Yangliao Geng · Boyang Li
[ East Exhibition Hall A-B ]
Abstract
Multi-task model merging offers a promising paradigm for integrating multiple expert models into a unified system without additional training. Existing state-of-the-art techniques, such as Task Arithmetic and its variants, merge models by accumulating task vectors—defined as the parameter differences between pre-trained and fine-tuned models. However, task vector accumulation is often hindered by knowledge conflicts, where conflicting components across different task vectors can lead to performance degradation during the merging process. To address this challenge, we propose **Conflict-Aware Task Merging (CAT Merging)**, a novel training-free framework that selectively trims conflict-prone components from the task vectors. CAT Merging introduces several parameter-specific strategies, including projection for linear weights and masking for scaling and shifting parameters in normalization layers. Extensive experiments on vision and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 4.7% (ViT-B/32) and 2.0% (ViT-L/14) over state-of-the-art methods.
Poster
Ming Zhang · Qixin Zhang · Xiao Luo · Junyu Luo · Wei Ju · Zhiping Xiao · Ming Zhang
[ East Exhibition Hall A-B ]
Abstract
Test-time adaptation aims to adapt a well-trained model using test data only, without accessing training data. It is a crucial topic in machine learning, enabling a wide range of applications in the real world, especially when it comes to data privacy. While existing works on test-time adaptation primarily focus on Euclidean data, research on non-Euclidean graph data remains scarce. Prevalent graph neural network methods could encounter serious performance degradation in the face of test-time domain shifts. In this work, we propose a novel method named Adaptive Subgraph-based Selection and Regularized Prototype Supervision (ASSESS) for reliable test-time adaptation on graphs. Specifically, to achieve flexible selection of reliable test graphs, ASSESS adopts an adaptive selection strategy based on fine-grained individual-level subgraph mutual information. Moreover, to utilize the information from both training and test graphs, ASSESS constructs semantic prototypes from the well-trained model as prior knowledge from the unknown training graphs and optimizes the posterior given the unlabeled test graphs. We also provide a theoretical analysis of the proposed algorithm. Extensive experiments verify the effectiveness of ASSESS against various baselines.
Poster
Andong Wang · Yuning Qiu · Zhong Jin · Guoxu Zhou · Qibin Zhao
[ East Exhibition Hall A-B ]
Abstract
Tensor regression is a powerful tool for analyzing complex multi-dimensional data in fields such as neuroimaging and spatiotemporal analysis, but its effectiveness is often hindered by insufficient sample sizes. To overcome this limitation, we adopt a transfer learning strategy that leverages knowledge from related source tasks to improve performance in data-scarce target tasks. This approach, however, introduces additional challenges including model shifts, covariate shifts, and decentralized data management. We propose the Low-Rank Tensor Transitions (LoRT) framework, which incorporates a novel fusion regularizer and a two-step refinement to enable robust adaptation while preserving low-tubal-rank structure. To support decentralized scenarios, we extend LoRT to D-LoRT, a distributed variant that maintains statistical efficiency with minimal communication overhead. Theoretical analysis and experiments on tensor regression tasks, including compressed sensing and completion, validate the robustness and versatility of the proposed methods. These findings indicate the potential of LoRT as a robust method for tensor regression in settings with limited data and complex distributional structures.
Poster
Fabio Feser · Marina Evangelou
[ East Exhibition Hall A-B ]
Abstract
The sparse-group lasso performs both variable and group selection, simultaneously using the strengths of the lasso and group lasso. It has found widespread use in genetics, a field that regularly involves the analysis of high-dimensional data, due to its sparse-group penalty, which allows it to utilize grouping information. However, the sparse-group lasso can be computationally expensive, due to the added shrinkage complexity, and its additional hyperparameter that needs tuning. This paper presents a novel feature reduction method, Dual Feature Reduction (DFR), that uses strong screening rules for the sparse-group lasso and the adaptive sparse-group lasso to reduce their input space before optimization, without affecting solution optimality. DFR applies two layers of screening through the application of dual norms and subdifferentials. Through synthetic and real data studies, it is shown that DFR drastically reduces the computational cost under many different scenarios.
Poster
Gabriele Visentin · Patrick Cheridito
[ East Exhibition Hall A-B ]
Abstract
We present a novel method for efficiently computing optimal transport maps and Wasserstein barycenters in high-dimensional spaces. Our approach uses conditional normalizing flows to approximate the input distributions as invertible pushforward transformations from a common latent space. This makes it possible to directly solve the primal problem using gradient-based minimization of the transport cost, unlike previous methods that rely on dual formulations and complex adversarial optimization. We show how this approach can be extended to compute Wasserstein barycenters by solving a conditional variance minimization problem. A key advantage of our conditional architecture is that it enables the computation of barycenters for hundreds of input distributions, which was computationally infeasible with previous methods. Our numerical experiments illustrate that our approach yields accurate results across various high-dimensional tasks and compares favorably with previous state-of-the-art methods.
Poster
Maricela Best Mckay · Avleen Kaur · Chen Greif · Brian Wetton
[ East Exhibition Hall A-B ]
Abstract
Natural gradient methods for PINNs have achieved state-of-the-art performance with errors several orders of magnitude smaller than those achieved by standard optimizers such as ADAM or L-BFGS. However, computing natural gradients for PINNs is prohibitively computationally costly and memory-intensive for all but small neural network architectures. We develop a randomized algorithm for natural gradient descent for PINNs that uses sketching to approximate the natural gradient descent direction. We prove that the change of coordinate Gram matrix used in a natural gradient descent update has rapidly-decaying eigenvalues for a one-layer, one-dimensional neural network and empirically demonstrate that this structure holds for four different example problems. Under this structure, our sketching algorithm is guaranteed to provide a near-optimal low-rank approximation of the Gramian. Our algorithm dramatically speeds up computation time and reduces memory overhead. Additionally, in our experiments, the sketched natural gradient outperforms the original natural gradient in terms of accuracy, often achieving an error that is an order of magnitude smaller. Training time for a network with around 5,000 parameters is reduced from several hours to under two minutes. Training can be practically scaled to large network sizes; we optimize a PINN for a network with over a million parameters within …
Poster
Jinbin Zhang · Nasib Ullah · Erik Schultheis · Rohit Babbar
[ East Exhibition Hall A-B ]
Abstract
Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations---gradient fusion and chunking---enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7GiB required by the optimized SOTA method, Renee without compromising accuracy.
Poster
Zhaolu Liu · Robert Peach · Mauricio Barahona
[ East Exhibition Hall A-B ]
Abstract
Kernel-based hypothesis tests offer a flexible, non-parametric tool to detect high-order interactions in multivariate data, beyond pairwise relationships. Yet the scalability of such tests is limited by the computationally demanding permutation schemes used to generate null approximations. Here we introduce a family of permutation-free high-order tests for joint independence and partial factorisations of $d$ variables. Our tests eliminate the need for permutation-based approximations by leveraging V-statistics and a novel cross-centring technique to yield test statistics with a standard normal limiting distribution under the null. We present implementations of the tests and showcase their efficacy and scalability through synthetic datasets. We also show applications inspired by causal discovery and feature selection, which highlight both the importance of high-order interactions in data and the need for efficient computational methods.
Poster
Hengrui Luo · Jeremy E. Purvis · Didong Li
[ East Exhibition Hall A-B ]
Abstract
Modern datasets often exhibit high dimensionality, yet the data reside in low-dimensional manifolds that can reveal underlying geometric structures critical for data analysis. A prime example of such a dataset is a collection of cell cycle measurements, where the inherently cyclical nature of the process can be represented as a circle or sphere. Motivated by the need to analyze these types of datasets, we propose a nonlinear dimension reduction method, Spherical Rotation Component Analysis (SRCA), that incorporates geometric information to better approximate low-dimensional manifolds. SRCA is a versatile method designed to work in both high-dimensional and small sample size settings. By employing spheres or ellipsoids, SRCA provides a low-rank spherical representation of the data with general theoretic guarantees, effectively retaining the geometric structure of the dataset during dimensionality reduction. A comprehensive simulation study, along with a successful application to human cell cycle data, further highlights the advantages of SRCA compared to state-of-the-art alternatives, demonstrating its superior performance in approximating the manifold while preserving inherent geometric structures.
Spotlight Poster
Neil Mallinar · Daniel Beaglehole · Libin Zhu · Adityanarayanan Radhakrishnan · Parthe Pandit · Misha Belkin
[ East Exhibition Hall A-B ]
Abstract
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM …
Poster
Jingyao Wang · Siyu Zhao · Wenwen Qiang · Jiangmeng Li · Changwen Zheng · Fuchun Sun · Hui Xiong
[ East Exhibition Hall A-B ]
Abstract
Multi-Modal Learning (MML) aims to learn effective representations across modalities for accurate predictions. Existing methods typically focus on modality consistency and specificity to learn effective representations. However, from a causal perspective, they may lead to representations that contain insufficient and unnecessary information. To address this, we propose that effective MML representations should be causally sufficient and necessary. Considering practical issues like spurious correlations and modality conflicts, we relax the exogeneity and monotonicity assumptions prevalent in prior works and explore the concepts specific to MML, i.e., Causal Complete Cause ($C^3$). We begin by defining $C^3$, which quantifies the probability of representations being causally sufficient and necessary. We then discuss the identifiability of $C^3$ and introduce an instrumental variable to support identifying $C^3$ with non-exogeneity and non-monotonicity. Building on this, we conduct the $C^3$ measurement, i.e., $C^3$ risk. We propose a twin network to estimate it through (i) the real-world branch: utilizing the instrumental variable for sufficiency, and (ii) the hypothetical-world branch: applying gradient-based counterfactual modeling for necessity. Theoretical analyses confirm its reliability. Based on these results, we propose $C^3$ Regularization, a plug-and-play method that enforces the causal completeness of the learned representations by minimizing $C^3$ risk. Extensive experiments demonstrate its effectiveness.
Poster
Ruitao Pu · Yang Qin · Xiaomin Song · Dezhong Peng · Zhenwen Ren · Yuan Sun
[ East Exhibition Hall A-B ]
Abstract
Recently, numerous cross-modal hashing (CMH) methods have been proposed, yielding remarkable progress. As a static learning paradigm, existing CMH methods often implicitly assume that all modalities are prepared before processing. However, in practice applications (such as multi-modal medical diagnosis), it is very challenging to collect paired multi-modal data simultaneously. Specifically, they are collected chronologically, forming streaming-media data (SMA). To handle this, all previous CMH methods require retraining on data from all modalities, which inevitably limits the scalability and flexibility of the model. In this paper, we propose a novel CMH paradigm named Streaming-media Hashing rEtrieval (SHE) that enables parallel training of each modality. Specifically, we first propose a knowledge library mining module (KLM) that extracts a prototype knowledge library for each modality, thereby revealing the commonality distribution of the instances from each modality. Then, we propose a knowledge library transfer module (KLT) that updates and aligns the new knowledge by utilizing the historical knowledge library, ensuring semantic consistency. Finally, to enhance intra-class semantic relevance and inter-class semantic disparity, we develop a discriminative hashing learning module (DHL). Comprehensive experiments on four benchmark datasets demonstrate the superiority of our SHE compared to 14 competitors.
Poster
Arnas Uselis · Andrea Dittadi · Seong Joon Oh
[ East Exhibition Hall A-B ]
Abstract
Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning.
Poster
Yegon Kim · Hyunsu Kim · Gyeonghoon Ko · Juho Lee
[ East Exhibition Hall A-B ]
Abstract
Accurately solving partial differential equations (PDEs) is critical to understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient training data from numerical solvers. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces this cost. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically generates only the most important time steps with the numerical solver, while employing the surrogate model to approximate the remaining steps. This dramatically reduces the cost incurred by each trajectory and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Burgers' equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, the incompressible Navier-Stokes equation, and the compressible Navier-Stokes equation.Experiments show that our approach improves performance by large margins over the best existing method. …
Poster
Rohan Deb · Kiran Thekumparampil · Kousha Kalantari · Gaurush Hiranandani · Shoham Sabach · Branislav Kveton
[ East Exhibition Hall A-B ]
Abstract
Supervised fine-tuning (SFT) is the most common way of adapting large language models (LLMs) to a new domain. In this paper, we improve the efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we select those that maximize information gain, as measured by the Fisher information matrix of the SFT objective. We approximate it efficiently by linearization at the last layer of the LLM. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, with both quantitative results and LLM-as-a-judge evaluations.
Poster
Ramya Ramalingam · Shayan Kiyani · Aaron Roth
[ East Exhibition Hall A-B ]
Abstract
Existing algorithms for online conformal prediction---guaranteeing marginal coverage in adversarial settings---are variants of online gradient descent (OGD), but their analyses of worst-case coverage do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We observe that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection between *threshold calibrated* coverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in the *follow the regularized leader* family of no regret learning algorithms (which includes online gradient descent) can be used to give group-conditional coverage guarantees in adversarial settings for arbitrary grouping functions. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes (2021).
Poster
Saumya Gaurang Shah · Abishek Sankararaman · Balakrishnan Narayanaswamy · Vikramank Singh
[ East Exhibition Hall A-B ]
Abstract
Can we efficiently choose the best Anomaly Detection (AD) algorithm for a data-stream without requiring anomaly labels? Streaming anomaly detection is hard. SOTA AD algorithms are sensitive to their hyperparameters and no single method works well on all datasets. The best algorithm/hyper-parameter combination for a given data-stream can change over time with data drift. 'What is an anomaly?' is often application, context and dataset dependent. We propose SEAD (Streaming Ensemble of Anomaly Detectors), the first model selection algorithm for streaming, unsupervised AD. All prior AD model selection algorithms are either supervised, or only work in the offline setting when all data from the test set is available upfront. We show that SEAD is {\em(i)} unsupervised, i.e., requires no true anomaly labels, {\em(ii)} efficiently implementable in a streaming setting, {\em (iii)} agnostic to the choice of the base algorithms among which it chooses from, and {\em (iv)} adaptive to non-stationarity in the data-stream. Experiments on 14 non-trivial public datasets and an internal dataset corroborate our claims.
Poster
Shaoan Xie · Lingjing Kong · Yujia Zheng · Zeyu Tang · Eric Xing · Guangyi Chen · Kun Zhang
[ East Exhibition Hall A-B ]
Abstract
Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored.In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability.
Poster
Haidong Kang
[ East Exhibition Hall A-B ]
Abstract
Neural Architecture Search (NAS) has recently outperformed hand-designed networks in various artificial intelligence areas. However, previous works only target a pre-defined task. For a new task in few-shot learning (FSL) scenarios, the architecture is either searched from scratch, which is neither efficient nor flexible, or borrowed architecture from the ones obtained on other tasks, which may lead to sub-optimal. Can we select the best neural architectures without involving any training and eliminate a significant portion of the search cost for new tasks in FSL? In this work, we provide an affirmative answer by proposing a novel information bottleneck (IB) theory driven \textit{Few-shot Neural Architecture Search} (dubbed, IBFS) framework to address this issue. We first derive that the global convergence of Model-agnostic meta-learning (MAML) can be guaranteed by only considering the first-order loss landscape. Moreover, motivated by the observation that IB provides a unified view toward understanding machine learning models, we propose a novel Zero-Cost method tailored for FSL to rank and select architectures based on their \textit{expressivity} obtained by IB mechanisms. Extensive experiments show that IBFS achieves state-of-the-art performance in FSL without training, which demonstrates the effectiveness of our IBFS.
Spotlight Poster
Juan L. Gamella · Simon Bing · Jakob Runge
[ East Exhibition Hall A-B ]
Abstract
We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors---the inputs to the experiment---are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We …
Poster
Siyuan Duan · Yuan Sun · Dezhong Peng · Guiduo Duan · Xi Peng · Peng Hu
[ East Exhibition Hall A-B ]
Abstract
Multi-view learning methods primarily focus on enhancing decision accuracy but often neglect the uncertainty arising from the intrinsic drawbacks of data, such as noise, conflicts, etc. To address this issue, several trusted multi-view learning approaches based on the Evidential Theory have been proposed to capture uncertainty in multi-view data. However, their performance is highly sensitive to conflicting views, and their uncertainty estimates, which depend on the total evidence and the number of categories, often underestimate uncertainty for conflicting multi-view instances due to the neglect of inherent conflicts between belief masses. To accurately classify conflicting multi-view instances and precisely estimate their intrinsic uncertainty, we present a novel Deep \underline{Fu}zzy \underline{M}ulti-View \underline{L}earning (\textbf{FUML}) method. Specifically, FUML leverages Fuzzy Set Theory to model the outputs of a classification neural network as fuzzy memberships, incorporating both possibility and necessity measures to quantify category credibility. A tailored loss function is then proposed to optimize the category credibility. To further enhance uncertainty estimation, we propose an entropy-based uncertainty estimation method leveraging category credibility. Additionally, we develop a Dual Reliable Multi-view Fusion (DRF) strategy that accounts for both view-specific uncertainty and inter-view conflict to mitigate the influence of conflicting views in multi-view fusion. Extensive experiments demonstrate that …
Poster
Yifan Wang · Hourun Li · Ling Yue · Zhiping Xiao · Jia Yang · Changling Zhou · Wei Ju · Ming Zhang · Xiao Luo
[ East Exhibition Hall A-B ]
Abstract
Graph neural networks (GNNs) have shown strong performance in graph fairness learning, which aims to ensure that predictions are unbiased with respect to sensitive attributes. However, existing approaches usually assume that training and test data share the same distribution, which rarely holds in the real world. To tackle this challenge, we propose a novel approach named Dual Unbiased Expansion with Group-acquired Alignment (DANCE) for graph fairness learning under distribution shifts. The core idea of our DANCE is to synthesize challenging yet unbiased virtual graph data in both graph and hidden spaces, simulating distribution shifts from a data-centric view. Specifically, we introduce the unbiased Mixup in the hidden space, prioritizing minor groups to address the potential imbalance of sensitive attributes. Simultaneously, we conduct fairness-aware adversarial learning in the graph space to focus on challenging samples and improve model robustness. To further bridge the domain gap, we propose a group-acquired alignment objective that prioritizes negative pair groups with identical sensitive labels. Additionally, a representation disentanglement objective is adopted to decorrelate sensitive attributes and target representations for enhanced fairness. Extensive experiments demonstrate the superior effectiveness of the proposed DANCE.
Poster
Angel Villar-Corrales · Sven Behnke
[ East Exhibition Hall A-B ]
Abstract
Predicting future scene representations is a crucial task for enabling robots to understand andinteract with the environment. However, most existing methods rely on videos and simulationswith precise action annotations, limiting their ability to leverage the large amount of avail-able unlabeled video data. To address this challenge, we propose PlaySlot, an object-centricvideo prediction model that infers object representations and latent actions from unlabeledvideo sequences. It then uses these representations to forecast future object states and videoframes. PlaySlot allows the generation of multiple possible futures conditioned on latent actions,which can be inferred from video dynamics, provided by a user, or generated by a learned actionpolicy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlotoutperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled videodemonstrations. Videos and code are available on our project website.
Poster
Yuhang Li · Guoxu Zhou · Zhenhao Huang · Xinqi Chen · Yuning Qiu · Qibin Zhao
[ East Exhibition Hall A-B ]
Abstract
Class-Incremental Learning (CIL) has gained considerable attention due to its capacity to accommodate new classes during learning. Replay-based methods demonstrate state-of-the-art performance in CIL but suffer from high memory consumption to save a set of old exemplars for revisiting. To address this challenge, many memory-efficient replay methods have been developed by exploiting image compression techniques. However, the gains are often bittersweet when pixel-level compression methods are used. Here, we present a simple yet efficient approach that employs tensor decomposition to address these limitations. This method fully exploits the low intrinsic dimensionality and pixel correlation of images to achieve high compression efficiency while preserving sufficient discriminative information, significantly enhancing performance. We also introduce a hybrid exemplar selection strategy to improve the representativeness and diversity of stored exemplars. Extensive experiments across datasets with varying resolutions consistently demonstrate that our approach substantially boosts the performance of baseline methods, showcasing strong generalization and robustness.
Poster
Chengbo Zang · Mehmet Turkcan · Gil Zussman · Zoran Kostic · Javad Ghaderi
[ East Exhibition Hall A-B ]
Abstract
We propose a framework for adaptive data collection aimed at robust learning in multi-distribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection and updates the model parameters accordingly. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.
Poster
Jiujia Zhang · Ashok Cutkosky
[ East Exhibition Hall A-B ]
Abstract
This paper addresses online learning with ''corrupted'' feedback. Our learner is provided with potentially corrupted gradients $\tilde g_t$ instead of the ''true'' gradients $g_t$. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult ``unconstrained'' setting in which our algorithm must maintain low regret with respect to any comparison point $u \in \mathbb{R}^d$. The unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret $ \|u\|G (\sqrt{T} + k) $ when $G \ge \max_t \|g_t\|$ is known, where $k$ is a measure of the total amount of corruption. When $G$ is unknown we incur an extra additive penalty of $(\|u\|^2+G^2) k$.
Poster
Nikhil Pratap Ghanathe · Steve Wilton
[ East Exhibition Hall A-B ]
Abstract
Uncertainty quantification (UQ) provides a resource-efficient solution for on-device monitoring of tinyML models deployed remotely without access to true labels. However, existing UQ methods impose significant memory and compute demands, making them impractical for ultra-low-power, KB-sized tinyML devices. Prior work has attempted to reduce overhead by using early-exit ensembles to quantify uncertainty in a single forward pass, but these approaches still carry prohibitive costs. To address this, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE introduces additional output blocks at the final exit of the base network, distilling early-exit knowledge into these blocks to form a diverse yet lightweight ensemble. We show that QUTE delivers superior uncertainty quality on tiny models, achieving comparable performance on larger models with 59% smaller model sizes than the closest prior work. When deployed on a microcontroller, QUTE demonstrates a 31% reduction in latency on average. In addition, we show that QUTE excels at detecting accuracy-drop events, outperforming all prior works.
Poster
Alessandro Pierro · Steven Abreu · Jonathan Timcheck · Philipp Stratmann · Andreas Wild · Sumit Shrestha
[ East Exhibition Hall A-B ]
Abstract
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements--when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets.We find that highly sparse linear RNNs *consistently* achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$\% less memory at iso-accuracy.Our models achieve state-of-the-art results on a real-time streaming task for audio denoising.By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of $42\times$ lower latency and $149\times$ lower energy consumption compared to a dense model on an edge GPU.Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.
Poster
Shuowei Jin · Xueshen Liu · Qingzhao Zhang · Zhuoqing Morley Mao
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6× reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.
Poster
Shimin Zhang · Ziyuan Ye · Yinsong Yan · Zeyang Song · Yujie Wu · Jibin Wu
[ East Exhibition Hall A-B ]
Abstract
Determining the similarity between dynamical systems remains a long-standing challenge in both machine learning and neuroscience. Recent works based on Koopman operator theory have proven effective in analyzing dynamical similarity by examining discrepancies in the Koopman spectrum. Nevertheless, existing similarity metrics can be severely constrained when systems exhibit complex nonlinear behaviors across multiple temporal scales. In this work, we propose **KoopSTD**, a dynamical similarity measurement framework that precisely characterizes the underlying dynamics by approximating the Koopman spectrum with explicit timescale decoupling and spectral residual control. We show that KoopSTD maintains invariance under several common representation-space transformations, which ensures robust measurements across different coordinate systems. Our extensive experiments on physical and neural systems validate the effectiveness, scalability, and robustness of KoopSTD compared to existing similarity metrics. We also apply KoopSTD to explore two open-ended research questions in neuroscience and large language models, highlighting its potential to facilitate future scientific and engineering discoveries. Code is available at [link](https://github.com/ZhangShimin1/KoopSTD).
Poster
Jennifer Hsia · Afreen Shaikh · Zhiruo Wang · Graham Neubig
[ East Exhibition Hall A-B ]
Abstract
Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.
Poster
Evan Frick · Connor Chen · Joseph Tennyson · Tianle Li · Wei-Lin Chiang · Anastasios Angelopoulos · Ion Stoica
[ East Exhibition Hall A-B ]
Abstract
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt or set of prompts. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the \#1 spot on the Chatbot Arena leaderboard. Our code is available at this GitHub link: https://github.com/lmarena/p2l.
Spotlight Poster
Zirui Liu · Jiatong Li · Yan Zhuang · Qi Liu · Shuanghong Shen · Jie Ouyang · Mingyue Cheng · Shijin Wang
[ East Exhibition Hall A-B ]
Abstract
Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
Poster
Sirui Lin · Zijun Gao · Jose Blanchet · Peter Glynn
[ East Exhibition Hall A-B ]
Abstract
Causal estimands can vary significantly depending on the relationship between outcomes in treatment and control groups, leading to wide partial identification (PI) intervals that impede decision making. Incorporating covariates can substantially tighten these bounds, but requires determining the range of PI over probability models consistent with the joint distributions of observed covariates and outcomes in treatment and control groups. This problem is known to be equivalent to a conditional optimal transport (COT) optimization task, which is more challenging than standard optimal transport (OT) due to the additional conditioning constraints. In this work, we study a tight relaxation of COT that effectively reduces it to standard OT, leveraging its well-established computational and theoretical foundations. Our relaxation incorporates covariate information and ensures narrower PI intervals for any value of the penalty parameter, while becoming asymptotically exact as a penalty increases to infinity. This approach preserves the benefits of covariate adjustment in PI and results in a data-driven estimator for the PI set that is easy to implement using existing OT packages. We analyze the convergence rate of our estimator and demonstrate the effectiveness of our approach through extensive simulations, highlighting its practical use and superior performance compared to existing methods.
Poster
Jennifer Brennan · Sébastien Lahaie · Adel Javanmard · Nick Doudchenko · Jean Pouget-Abadie
[ East Exhibition Hall A-B ]
Abstract
In experimental causal inference, we distinguish between two sources of uncertainty: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. The standard bootstrap procedure most often used by practitioners primarily estimates sampling uncertainty, and the causal bootstrap procedure, which accounts for design uncertainty, was developed for the completely randomized design and the difference-in-means estimator, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer program which computes numerically the worst-case copula used as an input to the causal bootstrap method, in a wide range of settings. Specifically, we prove the asymptotic validity of our approach for unconfounded, conditionally unconfounded, and and individualistic with bounded confoundedness assignments, as well as generalizing to any linear-in-treatment and quadratic-in-treatment estimators. We demonstrate the refined confidence intervals achieved through simulations of small geographical experiments.
Poster
Mingxuan Li · Junzhe Zhang · Elias Bareinboim
[ East Exhibition Hall A-B ]
Abstract
Reward shaping has been demonstrated to be an effective technique for accelerating the learning process of reinforcement learning (RL) agents. While successful in empirical applications, the design of a good shaping function is less well understood in principle and thus often relies on domain expertise and manual design. To overcome this limitation, we propose a novel automated approach for designing reward functions from offline data, possibly contaminated with the unobserved confounding bias.We propose to use causal state value upper bounds calculated from offline datasets as a conservative optimistic estimation of the optimal state value, which is then used as state potentials in Potential-Based Reward Shaping (PBRS). When applying our shaping function to a model-free learner based on UCB principles, we show that it enjoys a better gap-dependent regret bound than the learner without shaping. To the best of our knowledge, this is the first gap-dependent regret bound for PBRS in model-free learning with online exploration.Simulations support the theoretical findings.
Spotlight Poster
Quang-Duy Tran · Bao Duong · Phuoc Nguyen · Thin Nguyen
[ East Exhibition Hall A-B ]
Abstract
Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.
Poster
Ke Zhu · Jianing Chu · Ilya Lipkovich · Wenyu Ye · Shu Yang
[ East Exhibition Hall A-B ]
Abstract
Individualized treatment rules/recommendations (ITRs) aim to improve patient outcomes by tailoring treatments to the characteristics of each individual. However, in high-dimensional treatment settings, existing methods face significant challenges due to data sparsity within treatment groups and highly unbalanced covariate distributions across groups. To address these challenges, we propose a novel calibration-weighted treatment fusion procedure that robustly balances covariates across treatment groups and fuses similar treatments using a penalized working model. The fusion procedure ensures the recovery of latent treatment group structures when either the calibration model or the outcome model is correctly specified. In the fused treatment space, practitioners can seamlessly apply state-of-the-art ITR learning methods with the flexibility to utilize a subset of covariates, thereby achieving robustness while addressing practical concerns such as fairness. We establish theoretical guarantees, including consistency, the oracle property of treatment fusion, and regret bounds when integrated with multi-armed ITR learning methods such as policy trees. Simulation studies show superior group recovery and policy value compared to existing approaches. We illustrate the practical utility of our method using EHR-derived data from patients with Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma.
Poster
Emanuele Palumbo · Moritz Vandenhirtz · Alain Ryser · Imant Daunhawer · Julia Vogt
[ East Exhibition Hall A-B ]
Abstract
The hierarchical structure inherent in many real-world datasets makes the modeling of such hierarchies a crucial objective in both unsupervised and supervised machine learning. While recent advancements have introduced deep architectures specifically designed for hierarchical clustering, we adopt a critical perspective on this line of research. Our findings reveal that these methods face significant limitations in scalability and performance when applied to realistic datasets.~Given these findings, we present an alternative approach and introduce a lightweight method that builds on pre-trained non-hierarchical clustering models. Remarkably, our approach outperforms specialized deep models for hierarchical clustering, and it is broadly applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our approach, we extend its application to a supervised setting, demonstrating its ability to recover meaningful hierarchies from a pre-trained ImageNet classifier. Our results establish a practical and effective alternative to existing deep hierarchical clustering methods, with significant advantages in efficiency, scalability and performance.
Spotlight Poster
Xinyue Zheng · Haowei Lin · Kaichen He · Zihao Wang · Qiang Fu · Haobo Fu · Zilong Zheng · Yitao Liang
[ East Exhibition Hall A-B ]
Abstract
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce \textit{Minecraft Universe} (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5\% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.
Poster
Mara Finkelstein · Daniel Deutsch · Parker Riley · Juraj Juraska · Geza Kovacs · Markus Freitag
[ East Exhibition Hall A-B ]
Abstract
As LLMs continue to become more powerful and versatile, human evaluation has become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These *Autoraters* are typically designed so that they generalize to new systems *and* test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure the capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our *Specialist* method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
Poster
Weijian Luo · colin zhang · Debing Zhang · Zhengyang Geng
[ East Exhibition Hall A-B ]
Abstract
We propose Diff-Instruct*(DI*), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI*-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI*-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community.
Poster
Pierre Boyeau · Anastasios Angelopoulos · Tianle Li · Nir Yosef · Jitendra Malik · Michael Jordan
[ East Exhibition Hall A-B ]
Abstract
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.
Poster
Xinnuo Xu · Rachel Lawrence · Kshitij Dubey · Atharva Pandey · Risa Ueno · Fabian Falck · Aditya Nori · Rahul Sharma · Amit Sharma · Javier Gonzalez
[ East Exhibition Hall A-B ]
Abstract
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true “reasoning” or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE: a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
Poster
Chenyin Gao · Peter Gilbert · Larry Han
[ East Exhibition Hall A-B ]
Abstract
Standard conformal prediction ensures marginal coverage but consistently undercovers underrepresented groups, limiting its reliability for fair uncertainty quantification. Group fairness requires prediction sets to achieve a user-specified coverage level within each protected group. While group-wise conformal inference meets this requirement, it often produces excessively wide prediction sets due to limited sample sizes in underrepresented groups, highlighting a fundamental tradeoff between fairness and efficiency. To bridge this gap, we introduce Surrogate-Assisted Group-Clustered Conformal Inference (SAGCCI), a framework that improves efficiency through two key innovations: (1) clustering protected groups with similar conformal score distributions to enhance precision while maintaining fairness, and (2) deriving an efficient influence function that optimally integrates surrogate outcomes to construct tighter prediction sets. Theoretically, SAGCCI guarantees approximate group-conditional coverage in a doubly robust manner under mild convergence conditions, enabling flexible nuisance model estimation. Empirically, through simulations and an analysis of the phase 3 Moderna COVE COVID-19 vaccine trial, we demonstrate that SAGCCI outperforms existing methods, producing narrower prediction sets while maintaining valid group-conditional coverage, effectively balancing fairness and efficiency in uncertainty quantification.
Poster
Muhammed Yusuf Kocyigit · Eleftheria Briakou · Daniel Deutsch · Jiaming Luo · Colin Cherry · Markus Freitag
[ East Exhibition Hall A-B ]
Abstract
Data contamination—the accidental consumption of evaluation examples within the pre-training data—can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
Poster
Xinshuai Dong · Ignavier Ng · Boyang Sun · Haoyue Dai · Guangyuan Hao · Shunxing Fan · Peter Spirtes · Yumou Qiu · Kun Zhang
[ East Exhibition Hall A-B ]
Abstract
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery (code will be available at https://github.com/dongxinshuai/scm-identify).
Poster
Chris Hays · Manish Raghavan
[ East Exhibition Hall A-B ]
Abstract
Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain *shared states*, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).
Spotlight Poster
Georgy Noarov · Riccardo Fogliato · Martin A Bertran · Aaron Roth
[ East Exhibition Hall A-B ]
Abstract
We study the design of adaptive, sequential experiments for unbiased average treatment effect (ATE) estimation in the design-based potential outcomes setting. Our goal is to develop adaptive designs offering *sublinear Neyman regret*, meaning their efficiency must approach that of the hindsight-optimal nonadaptive design.Recent work [Dai et al, 2023] introduced ClipOGD, the first method achieving $\widetilde{O}(\sqrt{T})$ expected Neyman regret under mild conditions. In this work, we propose adaptive designs with substantially stronger Neyman regret guarantees. In particular, we modify ClipOGD to obtain anytime $\widetilde{O}(\log T)$ Neyman regret under natural boundedness assumptions. Further, in the setting where experimental units have pre-treatment covariates, we introduce and study a class of contextual ``multigroup'' Neyman regret guarantees: Given a set of possibly overlapping groups based on the covariates, the adaptive design outperforms each group's best non-adaptive designs. In particular, we develop a contextual adaptive design with $\widetilde{O}(\sqrt{T})$ anytime multigroup Neyman regret. We empirically validate the proposed designs through an array of experiments.
Poster
Undral Byambadalai · Tomu Hirata · Tatsushi Oka · Shota Yasui
[ East Exhibition Hall A-B ]
Abstract
This paper focuses on the estimation of distributional treatment effects in randomized experiments that use covariate-adaptive randomization (CAR). These include designs such as Efron's biased-coin design and stratified block randomization, where participants are first grouped into strata based on baseline covariates and assigned treatments within each stratum to ensure balance across groups. In practice, datasets often contain additional covariates beyond the strata indicators. We propose a flexible distribution regression framework that leverages off-the-shelf machine learning methods to incorporate these additional covariates, enhancing the precision of distributional treatment effect estimates. We establish the asymptotic distribution of the proposed estimator and introduce a valid inference procedure. Furthermore, we derive the semiparametric efficiency bound for distributional treatment effects under CAR and demonstrate that our regression-adjusted estimator attains this bound. Simulation studies and analyses of real-world datasets highlight the practical advantages of our method.
Spotlight Poster
Lifu Liu · Shiyuan He · Jianhua Guo
[ East Exhibition Hall A-B ]
Abstract
Efficiently counting Markov equivalent directed acyclic graphs (DAGs) is crucial in graphical causal analysis. Wienöbst et al. (2023) introduced a polynomial-time algorithm, known as the Clique-Picking algorithm, to count the number of Markov equivalent DAGs for a given completed partially directed acyclic graph (CPDAG). This algorithm iteratively selects a root clique, determines fixed orientations with outgoing edges from the clique, and generates the unresolved undirected connected components (UCCGs). In this work, we propose a more efficient approach to UCCG generation by utilizing previously computed results for different root cliques. Our method introduces the concept of super cliques within rooted clique trees, enabling their efficient transfer between trees with different root cliques. The proposed algorithm effectively reduces the computational complexity of the Clique-Picking method, particularly when the number of cliques is substantially smaller than the number of vertices and edges.
Poster
Shriya Bhatija · Paul-David Zuercher · Jakob Thumm · Thomas Bohné
[ East Exhibition Hall A-B ]
Abstract
In decision-making problems, the outcome of an intervention often depends on the causal relationships between system components and is highly costly to evaluate. In such settings, causal Bayesian optimization (CBO) exploits the causal relationships between the system variables and sequentially performs interventions to approach the optimum with minimal data. Extending CBO to the multi-outcome setting, we propose *multi-objective Causal Bayesian optimization* (MO-CBO), a paradigm for identifying Pareto-optimal interventions within a known multi-target causal graph. Our methodology first reduces the search space by discarding sub-optimal interventions based on the structure of the given causal graph. We further show that any MO-CBO problem can be decomposed into several traditional multi-objective optimization tasks. Our proposed MO-CBO algorithm is designed to identify Pareto-optimal interventions by iteratively exploring these underlying tasks, guided by relative hypervolume improvement. Experiments on synthetic and real-world causal graphs demonstrate the superiority of our approach over non-causal multi-objective Bayesian optimization in settings where causal information is available.
Poster
Muyang Cao · Jiajun Yu · Xin Du · Gang Pan · Wei Wang
[ East Exhibition Hall A-B ]
Abstract
The Metric Nearness Problem involves restoring a non-metric matrix to its closest metric-compliant form, addressing issues such as noise, missing values, and data inconsistencies. Ensuring metric properties, particularly the $O(N^3)$ triangle inequality constraints, presents significant computational challenges, especially in large-scale scenarios where traditional methods suffer from high time and space complexity. We propose a novel solution based on the tropical inner product (max-plus operation), which we prove satisfies the triangle inequality for non-negative real matrices. By transforming the problem into a continuous optimization task, our method directly minimizes the distance to the target matrix. This approach not only restores metric properties but also generates metric-preserving embeddings, enabling real-time updates and reducing computational and storage overhead for downstream tasks. Experimental results demonstrate that our method achieves up to 60× speed improvements over state-of-the-art approaches, and efficiently scales from $1e4 \times 1e4$ to $1e5 \times 1e5$ matrices with significantly lower memory usage.
Poster
Zhihan Zhu · Yanhao Zhang · Yong Xia
[ East Exhibition Hall A-B ]
Abstract
This paper introduces two novel criteria: one for feature selection and another for feature elimination in the context of best subset selection, which is a benchmark problem in statistics and machine learning. From the perspective of optimization, we revisit the classical selection and elimination criteria in traditional best subset selection algorithms, revealing that these classical criteria capture only partial variations of the objective function after the entry or exit of features. By formulating and solving optimization subproblems for feature entry and exit exactly, new selection and elimination criteria are proposed, proved as the optimal decisions for the current entry-and-exit process compared to classical criteria. Replacing the classical selection and elimination criteria with the proposed ones generates a series of enhanced best subset selection algorithms. These generated algorithms not only preserve the theoretical properties of the original algorithms but also achieve significant meta-gains without increasing computational cost across various scenarios and evaluation metrics on multiple tasks such as compressed sensing and sparse regression.
Poster
Yuanjie Shi · Hooman Shahrokhi · Xuesong Jia · Xiongzhi Chen · Jana Doppa · Yan Yan
[ East Exhibition Hall A-B ]
Abstract
Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples),while prior conformal training methods based on stochastic approximation for the quantile has a bound of $\Omega(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$).Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\\%\downarrow$ in the prediction set size and validates our theory.
Poster
Jiale Ma · Wenzheng Pan · Yang Li · Junchi Yan
[ East Exhibition Hall A-B ]
Abstract
Despite rapid progress in neural combinatorial optimization (NCO) for solving CO problems (COPs), as the problem scale grows, several bottlenecks persist: 1) solvers in the Global Prediction (GP) paradigm struggle in long-range decisions where the overly smooth intermediate heatmaps impede effective decoding, and 2) solvers in the Local Construction (LC) paradigm are time-consuming and incapable of tackling large instances due to the onerous auto-regressive process. Observing these challenges, we propose a new paradigm named Adaptive Expansion AE with its instantiation COExpander, positioned to leverage both advantages of GP and LC. COExpander utilizes informative heatmaps generated by a global predictor, which is learned under the guidance of locally determined partial solutions, to in turn direct the expansion of determined decision variables with adaptive step-sizes. To ensure transparent evaluation, we further take the lead to canonicalize 29 benchmarks spanning 6 popular COPs (MIS, MCl, MVC, MCut, TSP, ATSP) and various scales (50-10K nodes), upon which experiments demonstrate concrete SOTA performance of COExpander over these tasks. Source code and our standardized datasets will be made public.
Poster
Valentin De Bortoli · Alexandre Galashov · J Swaroop Guntupalli · Guangyao Zhou · Kevin Murphy · Arthur Gretton · Arnaud Doucet
[ East Exhibition Hall A-B ]
Abstract
Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively ``denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to speed up sample generation by learning the posterior distribution of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.
Spotlight Poster
Sumin Cho · Dongwon Kim · Kwangsu Kim
[ East Exhibition Hall A-B ]
Abstract
Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domain-specific features, known as undesired correlations. Gradient-based DG methods typically guide gradients in a dominant direction but often inadvertently reinforce spurious correlations. Recent work has employed dropout to regularize overconfident parameters, but has not explicitly adjusted gradient alignment or ensured balanced parameter updates. We propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter's contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR via a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, thereby promoting domain-invariant feature learning. Theoretically, GENIE balances convergence contribution and gradient alignment among parameters, achieving higher OSGR while retaining SGD's convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single-DG methods.
Spotlight Poster
Zhijing Wan · Zhixiang Wang · Zheng Wang · Xin Xu · Shin'ichi Satoh
[ East Exhibition Hall A-B ]
Abstract
One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
Poster
Aditya Vardhan Varre · Gizem Yüce · Nicolas Flammarion
[ East Exhibition Hall A-B ]
Abstract
In this article, we explore the loss landscape of next-token prediction with transformers. Specifically, we focus on learning in-context n-gram language models with cross-entropy loss using a simplified two-layer transformer. We design a series of transformers that represent $k$-grams (for $k \leq n$) for which the gradient of the population loss approaches zero in the limit of both infinite sequence length and infinite parameter norm. This construction reveals a key property of the loss landscape: \emph{$k$-grams are stationary points of the population cross-entropy loss}, offering theoretical insights for widely observed empirical phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by comprehensive numerical experiments that illustrate the dynamics of learning $n$-grams, characterized by jumps between stationary points.
Spotlight Poster
Marvin F, da Silva · Felix Dangel · Sageev Oore
[ East Exhibition Hall A-B ]
Abstract
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
Poster
Shivam Kumar · Yun Yang · Lizhen Lin
[ East Exhibition Hall A-B ]
Abstract
In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.
Poster
Chen Liu · Zhichao Huang · Mathieu Salzmann · Tong Zhang · Sabine Süsstrunk
[ East Exhibition Hall A-B ]
Abstract
Adversarial training is a popular method to robustify models against adversarial attacks. However, it exhibits much more severe overfitting than training on clean inputs. In this work, we investigate this phenomenon from the perspective of training instances, i.e., training input-target pairs. Based on a quantitative metric measuring the relative difficulty of an instance in the training set, we analyze the model's behavior on training instances of different difficulty levels. This lets us demonstrate that the decay in generalization performance of adversarial training is a result of fitting hard adversarial instances. We theoretically verify our observations for both linear and general nonlinear models, proving that models trained on hard instances have worse generalization performance than ones trained on easy instances, and that this generalization gap increases with the size of the adversarial budget. Finally, we investigate solutions to mitigate adversarial overfitting in several scenarios, including fast adversarial training and fine-tuning a pretrained model with additional data. Our results demonstrate that using training data adaptively improves the model's robustness.
Spotlight Poster
Anshuman Chhabra · Bo Li · Jian Chen · Prasant Mohapatra · Hongfu Liu
[ East Exhibition Hall A-B ]
Abstract
A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
Poster
Edward Milsom · Ben Anson · Laurence Aitchison
[ East Exhibition Hall A-B ]
Abstract
We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.
Poster
Deepak Ravikumar · Efstathia Soufleri · Abolfazl Hashemi · Kaushik Roy
[ East Exhibition Hall A-B ]
Abstract
Deep learning has become the de facto approach in nearly all learning tasks. It has been observed that deep models tend to memorize and sometimes overfit data, which can lead to compromises in performance, privacy, and other critical metrics.In this paper, we explore the theoretical foundations that connect memorization to sample loss, focusing on learning dynamics to understand what and how deep models memorize. To this end, we introduce a novel proxy for memorization: Cumulative Sample Loss (CSL).CSL represents the accumulated loss of a sample throughout the training process.CSL exhibits remarkable similarity to stability-based memorization, as evidenced by considerably high cosine similarity scores. We delve into the theory behind these results, demonstrating that low CSL leads to nontrivial bounds on the extent of stability-based memorization and learning time. The proposed proxy, CSL, is four orders of magnitude less computationally expensive than the stability-based method and can be obtained with zero additional overhead during training. We demonstrate the practical utility of the proposed proxy in identifying mislabeled samples and detecting duplicates where our metric achieves state-of-the-art performance.
Poster
Yuan Feng · Yukun Cao · Hairu Wang · Xike Xie · S Kevin Zhou
[ East Exhibition Hall A-B ]
Abstract
Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs.However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the Lego sketch, a novel sketch with superior scalability and accuracy.Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains.Theoretical analysis and empirical studies demonstrate its scalability and superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches.
Poster
Mingyu Xing · Lechao Cheng · Shengeng Tang · Yaxiong Wang · Zhun Zhong · Meng Wang
[ East Exhibition Hall A-B ]
Abstract
We introduce Knowledge Swapping, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction—starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of Learning Before Forgetting. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at https://github.com/xingmingyu123456/KnowledgeSwapping.
Poster
Jingbang Chen · Xinyuan Cao · Alicia Stepin · Li Chen
[ East Exhibition Hall A-B ]
Abstract
We study learning-augmented binary search trees (BSTs) via Treaps with carefully designed priorities.The result is a simple search tree in which the depth of each item $x$ is determined by its predicted weight $w_x$.Specifically, each item $x$ is assigned a composite priority of $-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform random variable. By choosing $w_x$ as the relative frequency of $x$, the resulting search trees achieve static optimality.This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions.Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP'09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.
Poster
Yingpeng Tang · Chao Ren · Xiaoli Tang · Sheng-Jun Huang · Lizhen Cui · Han Yu
[ East Exhibition Hall A-B ]
Abstract
Federated Active Learning (FAL) aims to learn an effective global model, while minimizing label queries. Owing to privacy requirements, it is challenging to design effective active data selection schemes due to the lack of cross-client query information. In this paper, we bridge this important gap by proposing the \underline{F}ederated \underline{A}ctive data selection by \underline{LE}verage score sampling (FALE) method. It is designed for regression tasks in the presence of non-i.i.d. client data to enable the server to select data globally in a privacy-preserving manner. Based on FedSVD, FALE aims to estimate the utility of unlabeled data and perform data selection via leverage score sampling. Besides, a secure model learning framework is designed for federated regression tasks to exploit supervision. FALE can operate without requiring an initial labeled set and select the instances in a single pass, significantly reducing communication overhead. Theoretical analyze establishes the query complexity for FALE to achieve constant factor approximation and relative error approximation. Extensive experiments on 11 benchmark datasets demonstrate significant improvements of FALE over existing state-of-the-art methods.
Poster
Gauthier Thurin · Kimia Nadjahi · Claire Boyer
[ East Exhibition Hall A-B ]
Abstract
Conformal Prediction (CP) is a principled framework for quantifying uncertainty in black-box learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or multiclass classification. Recent methods addressing this limitation impose predefined convex shapes for the prediction sets, potentially misaligning with the intrinsic data geometry. We introduce a novel CP procedure handling multivariate score functions through the lens of optimal transport. Specifically, we leverage Monge-Kantorovich vector ranks and quantiles to construct prediction region with flexible, potentially non-convex shapes, better suited to the complex uncertainty patterns encountered in multivariate learning tasks. We prove that our approach ensures finite-sample, distribution-free coverage properties, similar to typical CP methods. We then adapt our method for multi-output regression and multiclass classification, and also propose simple adjustments to generate adaptive prediction regions with asymptotic conditional coverage guarantees. Finally, we evaluate our method on practical regression and classification problems, illustrating its advantages in terms of (conditional) coverage and efficiency.
Poster
Xiaoli Tang · Han Yu · Zengxiang Li · Xiaoxiao Li
[ East Exhibition Hall A-B ]
Abstract
Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MBOS-AFL). Based on hierarchical reinforcement learning, MBOS-AFL jointly optimizes intersession budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs …
Poster
Qiufu Li · Huibin Xiao · Linlin Shen
[ East Exhibition Hall A-B ]
Abstract
When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-class tasks. In this paper, we compare BCE and CE in deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniformly high/low levels. Meanwhile, the classifier biases in BCE present a substantial constraint on the decision scores to explicitly enhance the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released.
Poster
Divyat Mahajan · Mohammad Pezeshki · Charles Arnal · Ioannis Mitliagkas · Kartik Ahuja · Pascal Vincent
[ East Exhibition Hall A-B ]
Abstract
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
Poster
Letong Hong · Zhangyang “Atlas” Wang
[ East Exhibition Hall A-B ]
Abstract
Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., \textit{early statistics effectively approximate global hyperparameter optima}, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.
Poster
Jeremy Bernstein · Laker Newhouse
[ East Exhibition Hall A-B ]
Abstract
An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.
Poster
Binghui Li · Yuanzhi Li
[ East Exhibition Hall A-B ]
Abstract
Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with $N$ separated training data points. First, we prove that, based on the assumption that we assume there is $\operatorname{poly}(D)$-size clean classifier (where $D$ is the data dimension), ReLU net with only $O(N D)$ extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with $O(N D)$ width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition …
Poster
Jaeho Kim · Seulki Lee
[ East Exhibition Hall A-B ]
Abstract
Unsupervised domain adaptation (UDA) for time series data remains a critical challenge in deep learning, with traditional pseudo-labeling strategies failing to capture temporal patterns and channel-wise shifts between domains, producing sub-optimal pseudo labels. As such, we introduce TransPL, a novel approach that addresses these limitations by modeling the joint distribution $P(X,y)$ of the source domain through code transition matrices, where the codes are derived from vector quantization (VQ) of time series patches. Our method constructs class- and channel-wise code transition matrices from the source domain and employs Bayes' rule for target domain adaptation, generating pseudo-labels based on channel-wise weighted class-conditional likelihoods. TransPL offers three key advantages: explicit modeling of temporal transitions and channel-wise shifts between different domains, versatility towards different UDA scenarios (e.g., weakly-supervised UDA), and explainable pseudo-label generation. We validate TransPL's effectiveness through extensive analysis on four time series UDA benchmarks and confirm that it consistently outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1\% accuracy improvement, 4.9\% F1 improvement), while providing interpretable insights into the domain adaptation process through its learned code transition matrices.
Poster
Ricardo Buitrago Ruiz · Albert Gu
[ East Exhibition Hall A-B ]
Abstract
Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths---i.e. they fail to length generalize.In this work, we provide comprehensive empirical and theoretical analysis to support the \textit{unexplored states hypothesis}, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all \textit{attainable} states (i.e. states that would be attained if the recurrence was applied to long sequences).Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.
Poster
Abrar Majeedi · Viswanatha Reddy Gajjala · Satya Sai Srinath Namburi GNVV · Nada Elkordi · Yin Li
[ East Exhibition Hall A-B ]
Abstract
Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.
Poster
Shubham Gupta · Thibaut Durand · Graham Taylor · Lilian Bialokozowicz
[ East Exhibition Hall A-B ]
Abstract
We present a novel prompt design for Large Language Models (LLMs) tailored to **Asynchronous Time Series**. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation. We further introduce **Stochastic Soft Prompting**, a novel prompt-tuning mechanism that significantly improves model performance, outperforming existing finetuning methods such as QLORA. Through extensive experiments on real-world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets.
Poster
Wenzhe Niu · Zongxia Xie · Yanru Sun · Wei He · Man Xu · Chao Hao
[ East Exhibition Hall A-B ]
Abstract
Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposed **LangTime**, a **lan**guage-**g**uided unified model for **time** series forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.
Poster
Yihang Wang · Yuying Qiu · Peng Chen · Yang Shu · Zhongwen Rao · Lujia Pan · Bin Yang · Chenjuan Guo
[ East Exhibition Hall A-B ]
Abstract
Existing works on general time series forecasting build foundation models with heavy model parameters through large-scale multi-source pretraining. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverage historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot setting with much better efficiency compared with existing time series foundation models
Spotlight Poster
Attila Szász · Balázs Bánhelyi · Mark Jelasity
[ East Exhibition Hall A-B ]
Abstract
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.
Poster
Renpu Liu · Hao Wu · Jiawei Zhang · Xin Cheng · Xiangyang Luo · Bin Ma · Jinwei Wang
[ East Exhibition Hall A-B ]
Abstract
Adversarial examples have been shown to deceive Deep Neural Networks (DNNs), raising widespread concerns about this security threat. More seriously, as different DNN models share critical features, feature-level attacks can generate transferable adversarial examples, thereby deceiving black-box models in real-world scenarios. Nevertheless, we have theoretically discovered the principle behind the limited transferability of existing feature-level attacks: Their attack effectiveness is essentially equivalent to perturbing features in one step along the direction of feature importance in the feature space, despite performing multiple perturbations in the pixel space. This finding indicates that existing feature-level attacks are inefficient in disrupting features through multiple pixel-space perturbations. To address this problem, we propose a P2FA that efficiently perturbs features multiple times. Specifically, we directly shift the perturbed space from pixel to feature space. Then, we perturb the features multiple times rather than just once in the feature space with the guidance of feature importance to enhance the efficiency of disrupting critical shared features. Finally, we invert the perturbed features to the pixels to generate more transferable adversarial examples. Numerous experimental results strongly demonstrate the superior transferability of P2FA over State-Of-The-Art (SOTA) attacks.
Poster
Yunjuan Wang · Raman Arora
[ East Exhibition Hall A-B ]
Abstract
Despite the remarkable success of large foundation models across a range of tasks, they remain susceptible to security threats such as backdoor attacks. By injecting poisoned data containing specific triggers during training, adversaries can manipulate model predictions in a targeted manner. While prior work has focused on empirically designing and evaluating such attacks, a rigorous theoretical understanding of when and why they succeed is lacking. In this work, we analyze backdoor attacks that exploit the token selection process within attention mechanisms--a core component of transformer-based architectures. We show that single-head self-attention transformers trained via gradient descent can interpolate poisoned training data. Moreover, we prove that when the backdoor triggers are sufficiently strong but not overly dominant, attackers can successfully manipulate model predictions. Our analysis characterizes how adversaries manipulate token selection to alter outputs and identifies the theoretical conditions under which these attacks succeed. We validate our findings through experiments on synthetic datasets.
Poster
Xinpeng Dong · Min Zhang · Didi Zhu · Ye Jian · zhang keli · Aimin Zhou · Fei Wu · Kun Kuang
[ East Exhibition Hall A-B ]
Abstract
Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.
Poster
Mohamad Chehade · Wenting Li · Brian Bell · Russell Bent · Saif Kazi · Hao Zhu
[ East Exhibition Hall A-B ]
Abstract
The robustness of neural networks is crucial in safety-critical applications, where identifying a reliable input space is essential for effective model selection, robustness evaluation, and the development of reliable control strategies. Most existing robustness verification methods assess the worst-case output under the assumption that the input space is known. However, precisely identifying a verifiable input space $ \mathcal{C} $, where no adversarial examples exist, is challenging due to the possible high dimensionality, discontinuity, and non-convex nature of the input space. To address this challenge, we propose a novel framework, **LEVIS**, comprising **LEVIS-$\alpha$** and **LEVIS-$\beta$**. **LEVIS-$\alpha$** identifies a single, large verifiable ball that intersects at least two boundaries of a bounded region $ \mathcal{C} $, while **LEVIS-$\beta$** systematically captures the entirety of the verifiable space by integrating multiple verifiable balls. Our contributions are fourfold: we introduce a verification framework, **LEVIS**, incorporating two optimization techniques for computing nearest and directional adversarial points based on mixed-integer programming (MIP); to enhance scalability, we integrate complementary constrained (CC) optimization with a reduced MIP formulation, achieving up to a 17-fold reduction in runtime by approximating the verifiable region in a principled way; we provide a theoretical analysis characterizing the properties of the verifiable balls obtained through …
Poster
Spandan Pyakurel · Qi Yu
[ East Exhibition Hall A-B ]
Abstract
Fine-Grained Openset Detection (FGOD) poses a fundamental challenge due to the similarity between the openset classes and those closed-set ones. Since real-world objects/entities tend to form a hierarchical structure, the fine-grained relationship among the closed-set classes as captured by the hierarchy could potentially improve the FGOD performance. Intuitively, the hierarchical dependency among different classes allows the model to recognize their subtle differences, which in turn makes it better at differentiating similar open-set classes even they may share the same parent. However, simply performing openset detection in a top-down fashion by building a local detector for each node may result in a poor detection performance. Our theoretical analysis also reveals that maximizing the probability of the path leading to the ground-truth leaf node also results in a sub-optimal training process. To systematically address this issue, we propose to formulate a novel state-based node representation, which constructs a state space based upon the entire hierarchical structure. We prove that the state-based representation guarantees to maximize the probability on the path leading to the ground-truth leaf node. Extensive experiments on multiple real-world hierarchical datasets clearly demonstrate the superior performance of the proposed method.
Poster
Nikita Durasov · Assaf Shocher · Doruk Oner · Gal Chechik · Alexei Efros · EPFL Pascal Fua
[ East Exhibition Hall A-B ]
Abstract
Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT3), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence---where repeated applications of a function yield the same result---can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
Poster
Yuhao Sun · Jiacheng Zhang · Zesheng Ye · Chaowei Xiao · Feng Liu
[ East Exhibition Hall A-B ]
Abstract
*Diffusion-based purification* (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t*$ for all samples in existing methods. In this paper, we discover that an optimal $t*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate *distinct noise levels to different samples* in …
Spotlight Poster
Hong-Ming Chiu · Hao Chen · Huan Zhang · Richard Zhang
[ East Exhibition Hall A-B ]
Abstract
Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound---derived via SDP principles---that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2.47 million parameters, achieving tightness that approaches that of costly SDP-based methods.
Poster
Xiaoyu Yang · Lijian Xu · Hongsheng Li · Shaoting Zhang
[ East Exhibition Hall A-B ]
Abstract
This paper proposes a scalable and straightforward pre-training paradigm for efficient visual conceptual representation called occluded image contrastive learning (OCL). Our OCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind OCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, OCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, OCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8\% accuracy in downstream fine-tuning tasks. Code is available at https://github.com/XiaoyuYoung/OCL.
Poster
Mengzhu Wang · houcheng su · Jiao Li · Chuan Li · Nan Yin · Li Shen · Jingcai Guo
[ East Exhibition Hall A-B ]
Abstract
Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.
Poster
Samar Khanna · Medhanie Irgau · David Lobell · Stefano Ermon
[ East Exhibition Hall A-B ]
Abstract
Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach …
Poster
Jiaqi Yan · Changping Wang · De Ma · Huajin Tang · Qian Zheng · Gang Pan
[ East Exhibition Hall A-B ]
Abstract
Spiking Neural Networks (SNNs) are considered promising energy-efficient models due to their dynamic capability to process spatial-temporal spike information. Existing work has demonstrated that SNNs exhibit temporal heterogeneity, which leads to diverse outputs of SNNs at different time steps and has the potential to enhance their performance. Although SNNs obtained by direct training methods achieve state-of-the-art performance, current methods introduce limited temporal heterogeneity through the dynamics of spiking neurons or network structures. They lack the improvement of temporal heterogeneity through the lens of the gradient. In this paper, we first conclude that the diversity of the temporal logit gradients in current methods is limited. This leads to insufficient temporal heterogeneity and results in temporally miscalibrated SNNs with degraded performance. Based on the above analysis, we propose a Temporal Model Calibration (TMC) method, which can be seen as a logit gradient rescaling mechanism across time steps. Experimental results show that our method can improve the temporal logit gradient diversity and generate temporally calibrated SNNs with enhanced performance. In particular, our method achieves state-of-the-art accuracy on ImageNet, DVSCIFAR10, and N-Caltech101. Codes are available at https://github.com/zju-bmi-lab/TMC.
Poster
Zhiyao Ren · Siyuan Liang · Aishan Liu · Dacheng Tao
[ East Exhibition Hall A-B ]
Abstract
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02\% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (*e.g.*, GPT-4).
Poster
Guangtao Zheng · Wenqian Ye · Aidong Zhang
[ East Exhibition Hall A-B ]
Abstract
Deep neural networks often develop spurious bias, reliance on correlations between non-essential features and classes for predictions. For example, a model may identify objects based on frequently co-occurring backgrounds rather than intrinsic features, resulting in degraded performance on data lacking these correlations. Existing mitigation approaches typically depend on external annotations of spurious correlations, which may be difficult to obtain and are not relevant to the spurious bias in a model. In this paper, we take a step towards self-guided mitigation of spurious bias by proposing NeuronTune, a post hoc method that directly intervenes in a model's internal decision process. Our method probes in a model's latent embedding space to identify and regulate neurons that lead to spurious prediction behaviors. We theoretically justify our approach and show that it brings the model closer to an unbiased one. Unlike previous methods, NeuronTune operates without requiring spurious correlation annotations, making it a practical and effective tool for improving model robustness. Experiments across different architectures and data modalities demonstrate that our method significantly mitigates spurious bias in a self-guided way.
Spotlight Poster
Siyuan Duan · Wenyuan Wu · Peng Hu · Zhenwen Ren · Dezhong Peng · Yuan Sun
[ East Exhibition Hall A-B ]
Abstract
Physics-informed neural networks (PINNs) aim to constrain the outputs and gradients of deep learning models to satisfy specified governing physics equations, which have demonstrated significant potential for solving partial differential equations (PDEs). Although existing PINN methods have achieved pleasing performance, they always treat both easy and hard sample points indiscriminately, especially ones in the physical boundaries. This easily causes the PINN model to fall into undesirable local minima and unstable learning, thereby resulting in an Unbalanced Prediction Problem (UPP). To deal with this daunting problem, we propose a novel framework named Cognitive Physical Informed Neural Network (CoPINN) that imitates the human cognitive learning manner from easy to hard. Specifically, we first employ separable subnetworks to encode independent one-dimensional coordinates and apply an aggregation scheme to generate multi-dimensional predicted physical variables. Then, during the training phase, we dynamically evaluate the difficulty of each sample according to the gradient of the PDE residuals. Finally, we propose a cognitive training scheduler to progressively optimize the entire sampling regions from easy to hard, thereby embracing robustness and generalization against predicting physical boundary regions. Extensive experiments demonstrate that our CoPINN achieves state-of-the-art performance, particularly significantly reducing prediction errors in stubborn regions.
Poster
Nico Pelleriti · Max Zimmer · Elias Wirth · Sebastian Pokutta
[ East Exhibition Hall A-B ]
Abstract
Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.
Poster
Xiangheng Wang · Ziquan Fang · Chenglong Huang · Danlei Hu · Lu Chen · Yunjun Gao
[ East Exhibition Hall A-B ]
Abstract
Trajectory representation learning aims to transform raw trajectory data into compact and low-dimensional vectors that are suitable for downstream analysis. However, most existing methods adopt either a free-space view or a road-network view during the learning process, which limits their ability to capture the complex, multi-view spatiotemporal features inherent in trajectory data. Moreover, these approaches rely on task-specific model training, restricting their generalizability and effectiveness for diverse analysis tasks. To this end, we propose GTR, a general, multi-view, and dynamic Trajectory Representation framework built on a pre-train and fine-tune architecture. Specifically, GTR introduces a multi-view encoder that captures the intrinsic multi-view spatiotemporal features. Based on the pre-train and fine-tune architecture, we provide the spatio-temporal fusion pre-training with a spatio-temporal mixture of experts to dynamically combine spatial and temporal features, enabling seamless adaptation to diverse trajectory analysis tasks. Furthermore, we propose an online frozen-hot updating strategy to efficiently update the representation model, accommodating the dynamic nature of trajectory data. Extensive experiments on two real-world datasets demonstrate that GTR consistently outperforms 15 state-of-the-art methods across 6 mainstream trajectory analysis tasks. All source code and data are available at https://github.com/ZJU-DAILY/GTR.
Poster
yu chen · Jing Lian · Zhaofei Yu · Jizhao Liu · Jisheng Dang · Gang Wang
[ East Exhibition Hall A-B ]
Abstract
Event cameras are bio-inspired vision sensors that encode visual information with high dynamic range, high temporal resolution, and low latency. Current state-of-the-art event stream processing methods rely on end-to-end deep learning techniques. However, these models are heavily dependent on data structures, limiting their stability and generalization capabilities across tasks, thereby hindering their deployment in real-world scenarios. To address this issue, we propose a chaotic dynamics event signal processing framework inspired by the dorsal visual pathway of the brain. Specifically, we utilize Continuous-coupled Neural Network (CCNN) to encode the event stream. CCNN encodes polarity-invariant event sequences as periodic signals and polarity-changing event sequences as chaotic signals. We then use continuous wavelet transforms to analyze the dynamical states of CCNN neurons and establish the high-order mappings of the event stream. The effectiveness of our method is validated through integration with conventional classification networks, achieving state-of-the-art classification accuracy on the N-Caltech101 and N-CARS datasets, with results of 84.3% and 99.9%, respectively. Our method improves the accuracy of event camera-based object classification while significantly enhancing the generalization and stability of event representation.
Poster
Zhixuan Chen · Xing Hu · Dawei Yang · Zukang Xu · XUCHEN · Zhihang Yuan · Sifan Zhou · JiangyongYu
[ East Exhibition Hall A-B ]
Abstract
Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE’s sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoEQuant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the …
Poster
Yuli Chen · Bo Cheng · Jiale Han · Yingying Zhang · Yingting Li · Shuhao Zhang
[ East Exhibition Hall A-B ]
Abstract
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code\footnote{The code is available at: \url{https://github.com/ironartisan/DLP}.} to facilitate future research.
Poster
Disen Lan · Weigao Sun · Jiaxi Hu · Jusen Du · Yu Cheng
[ East Exhibition Hall A-B ]
Abstract
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents \textbf{Liger}, short for \underline{\textbf{Li}}nearizing LLMs to \underline{\textbf{g}}at\underline{\textbf{e}}d \underline{\textbf{r}}ecurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM performance at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to …
Poster
Siqi Guo · Ilgee Hong · Vicente Balmaseda · Changlong Yu · Liang Qiu · Xin Liu · Haoming Jiang · Tuo Zhao · Tianbao Yang
[ East Exhibition Hall A-B ]
Abstract
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce **Discriminative Fine-Tuning (DFT)**, an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a **discriminative paradigm** that increases the probability of positive answers while suppressing potentially negative ones, aiming for **data prediction** instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than …
Poster
Dan Busbridge · Amitis Shidani · Floris Weers · Jason Ramapuram · Etai Littwin · Russell Webb
[ East Exhibition Hall A-B ]
Abstract
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.
Poster
Zhen Xiang · Linzhi Zheng · Yanjie Li · Junyuan Hong · Qinbin Li · Han Xie · Jiawei Zhang · Zidi Xiong · Chulin Xie · Carl Yang · Dawn Song · Bo Li
[ East Exhibition Hall A-B ]
Abstract
The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83%guardrail accuracies, respectively. Project page: https://guardagent.github.io/
Poster
Jingtan Wang · Xiaoqiang Lin · Rui Qiao · Pang Wei Koh · Chuan-Sheng Foo · Bryan Kian Hsiang Low
[ East Exhibition Hall A-B ]
Abstract
Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on next-token prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset. Our code is available at https://github.com/JTWang2000/NICE.
Poster
Zhichen Dong · Zhanhui Zhou · Zhixuan Liu · Chao Yang · Chaochao Lu
[ East Exhibition Hall A-B ]
Abstract
In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.
Poster
Zhenni Bi · Kai Han · Chuanjian Liu · Yehui Tang · Yunhe Wang
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have demonstrated remarkable abilities across various language tasks, but solving complex reasoning problems remains a significant challenge. While existing methods, such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT), enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this limitation, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT employs sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction, along with consensus-guided decision-making strategies to optimize both correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.
Poster
Baohao Liao · Yuhui Xu · Hanze Dong · Junnan Li · Christof Monz · Silvio Savarese · Doyen Sahoo · Caiming Xiong
[ East Exhibition Hall A-B ]
Abstract
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4X fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.
Poster
Jasper Dekoninck · Maximilian Baader · Martin Vechev
[ East Exhibition Hall A-B ]
Abstract
The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the cost-performance tradeoff. Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found. However, current approaches face three key limitations: they (1) lack formal proofs of optimality, (2) fail to identify the conditions under which these strategies are most effective to improve the cost-performance tradeoff, and (3) are unable to combine both paradigms for further improvements. To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we propose *cascade routing*, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.
Poster
Archiki Prasad · Weizhe Yuan · Richard Yuanzhe Pang · Jing Xu · Maryam Fazel-Zarandi · Mohit Bansal · Sainbayar Sukhbaatar · JASON WESTON · Jane Dwivedi-Yu
[ East Exhibition Hall A-B ]
Abstract
Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
Poster
Lifan Yuan · Wendi Li · Huayu Chen · Ganqu Cui · Ning Ding · Kaiyan Zhang · Bowen Zhou · Zhiyuan Liu · Hao Peng
[ East Exhibition Hall A-B ]
Abstract
Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine-grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models rϕ(y) = β log πϕ(y) πref(y) , which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our …
Spotlight Poster
Weizhong Huang · Yuxin Zhang · Xiawu Zheng · Fei Chao · Rongrong Ji
[ East Exhibition Hall A-B ]
Abstract
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of **"reconstruction error explosion"** in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction …
Spotlight Poster
Xinyu Guan · Li Lyna Zhang · Yifei Liu · Ning Shang · Youran Sun · Yi Zhu · Fan Yang · Mao Yang
[ East Exhibition Hall A-B ]
Abstract
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising ``deep thinking'' through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0%, surpassing o1-preview by +4.5%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% of the brightest high school math students. …
Poster
Ruibo Chen · Yihan Wu · Junfeng Guo · Heng Huang
[ East Exhibition Hall A-B ]
Abstract
Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.
Spotlight Poster
Utkarsh Saxena · Sayeh Sharify · Kaushik Roy · Xin Wang
[ East Exhibition Hall A-B ]
Abstract
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g.~8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available at https://anonymous.4open.science/r/project-resq-2142.
Poster
Patrick Leask · Neel Nanda · Noura Al Moubayed
[ East Exhibition Hall A-B ]
Abstract
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity. Code available at https://github.com/pleask/itda.
Poster
Simin Chen · Pranav Pusarla · Baishakhi Ray
[ East Exhibition Hall A-B ]
Abstract
The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. However, current benchmarking predominantly relies on publicly available, human-created datasets. The widespread use of these static benchmark datasets makes the evaluation process particularly susceptible to data contamination—an unavoidable consequence of the extensive data collection processes employed during LLM training. Existing methods for addressing data contamination typically face significant limitations, including reliance on substantial human effort and difficulty in managing class imbalances. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. Given an initial seed programming problem, DyCodeEval utilizes multiple agents to systematically extract and modify contextual information without changing the core logic, generating semantically equivalent variations. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes.
Poster
Peiyuan Zhang · Yongqi Chen · Runlong Su · Hangliang Ding · Ion Stoica · Zhengzhong Liu · Hao Zhang
[ East Exhibition Hall A-B ]
Abstract
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 950 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being \emph{hardware-efficient}. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79\% MFU -- 7.17× faster than prior art methods. On the leading video DiT model, Hunyuan, it accelerates attention by 1.6–10x over FlashAttention-3, yielding a 1.36–3.53× end-to-end speedup with no or minimum quality loss.
Poster
Yilun Zhou · Austin Xu · PeiFeng Wang · Caiming Xiong · Shafiq Joty
[ East Exhibition Hall A-B ]
Abstract
Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.
Poster
Lakshmi Nair · Ian Trase · J. Kim
[ East Exhibition Hall A-B ]
Abstract
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at: https://github.com/flagshippioneering/Flow-of-Options.
Spotlight Poster
Xin Chen · Yarden As · Andreas Krause
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP–short for Safety Polytope–a geometric approach to LLM safety, that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.
Poster
Linhao Luo · Zicheng Zhao · Reza Haffari · Yuan-Fang Li · Chen Gong · Shirui Pan
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
Poster
Qinsi Wang · Hancheng Ye · Ming-Yu Chung · Yudong Liu · Yueqian Lin · Martin Kuo · Mingyuan Ma · Jianyi Zhang · Yiran Chen
[ East Exhibition Hall A-B ]
Abstract
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency.Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered?In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5$\times$ FLOPs reduction and a 10$\times$ overall speedup.Code is released at [https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main](https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main).
Spotlight Poster
Vaishnavh Nagarajan · Chen Wu · Charles Ding · Aditi Raghunathan
[ East Exhibition Hall A-B ]
Abstract
We design a suite of minimal algorithmic tasks that are a loose abstraction of _open-ended_ real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model.Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended _stochastic_ planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed _seed-conditioning_) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
Poster
Rylan Schaeffer · Hailey Schoelkopf · Brando Miranda · Gabriel Mukobi · Varun Madan · Adam Ibrahim · Herbie Bradley · Stella Biderman · Sanmi Koyejo
[ East Exhibition Hall A-B ]
Abstract
Predictable behavior from scaling advanced AI systems is an extremely desirable property for engineers, companies, economists and governments alike, and while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper shines a light on a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also …
Poster
Yue Liu · Xiaoxin He · Miao Xiong · Jinlan Fu · Shumin Deng · YINGWEI MA · Jiaheng Zhang · Bryan Hooi
[ East Exhibition Hall A-B ]
Abstract
This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97\% attack success rate across 8 LLMs on average and $\sim$98\% bypass rate against 5 guard models on average.
Spotlight Poster
John Schultz · Jakub Adamek · Matej Jusup · Marc Lanctot · Michael Kaisers · Sarah Perrin · Daniel Hennes · Jeremy Shar · Cannada Lewis · Anian Ruoss · Tom Zahavy · Petar Veličković · Laurel Prince · Satinder Singh · Eric Malmi · Nenad Tomasev
[ East Exhibition Hall A-B ]
Abstract
Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: In *external search*, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and in *internal search*, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
Poster
Aman Gupta · Shao Tang · Qingquan Song · Sirou Zhu · Jiwoo Hong · Ankan Saha · Viral Gupta · Noah Lee · Eunki Kim · Siyu Zhu · Parag Agrawal · Natesh Pillai · Sathiya Keerthi
[ East Exhibition Hall A-B ]
Abstract
Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15\% to 50\% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one …
Poster
Zihan Chen · Song Wang · Zhen Tan · Jundong Li · Cong Shen
[ East Exhibition Hall A-B ]
Abstract
In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose **M**any-Shot **A**daptive **P**seudo-**L**ab**E**ling, namely **MAPLE**, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs.Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data. Our code is provided at https://github.com/Chen-1031/MAPLE_ICL.
Poster
Karina Zainullina · Aleksandr Golubev · Maria Trofimova · Sergei Polezhaev · Ibragim Badertdinov · Daria Litvintseva · Simon Karasik · Filipp Fisin · Sergei Skvortsov · Maksim Nekrashevich · Anton Shevtsov · Boris Yangel
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for *non-serializable* RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving $40.8$\%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
Poster
Kaixuan Huang · Jiacheng Guo · Zihao Li · Xiang Ji · Jiawei Ge · Wenzhe Li · Yingqing Guo · Tianle Cai · Hui Yuan · Runzhe Wang · Yue Wu · Ming Yin · Shange Tang · Yangsibo Huang · Chi Jin · Xinyun Chen · Chiyuan Zhang · Mengdi Wang
[ East Exhibition Hall A-B ]
Abstract
Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available at https://math-perturb.github.io/.
Poster
Alexander Wettig · Kyle Lo · Sewon Min · Hannaneh Hajishirzi · Danqi Chen · Luca Soldaini
[ East Exhibition Hall A-B ]
Abstract
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Poster
Jingyue Gao · Runji Lin · Keming Lu · Bowen Yu · Junyang Lin · Jianyu Chen
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries.This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages.To address such challenge, we introduce **MARGE**: Improving **Ma**th **R**easoning with **G**uided **E**xploration, a novel method that enhances mathematical reasoning through hit-guided exploration.MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process.Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods.These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data.
Poster
Hee Suk Yoon · Eunseop Yoon · Mark Hasegawa-Johnson · Sungwoong Kim · Chang Yoo
[ East Exhibition Hall A-B ]
Abstract
We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.
Poster
Wonsuk Jang · Thierry Tambe
[ East Exhibition Hall A-B ]
Abstract
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
Poster
Vignav Ramesh · Kenneth Li
[ East Exhibition Hall A-B ]
Abstract
Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via *activations*; concretely, we pause an LM $B$'s computation at an intermediate layer, combine its current activation with another LM $A$'s intermediate activation via some function $f$, then pass $f$'s output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with *zero* additional parameters and data, and saves a *substantial amount of compute* over natural language communication. We test our method with various functional forms $f$ on two experimental setups—multi-player coordination games and reasoning benchmarks—and find that it achieves up to $27$% improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the …
Poster
Zhepei Wei · Wei-Lin Chen · Xinyu Zhu · Yu Meng
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware’s parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary “drafter” model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the generated outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens—particularly simple or highly-predictable ones—can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token’s computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in …
Poster
XUCHEN · Yuxuan Yue · Zukang Xu · Xing Hu · JiangyongYu · Zhixuan Chen · Sifan Zhou · Zhihang Yuan · Dawei Yang
[ East Exhibition Hall A-B ]
Abstract
RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models.However, it suffers significant degradation of performance when applied to RWKV.This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy.To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV.Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1\% accuracy loss and 2.14$\times$ speed up.
Poster
Yifan Sun · Han Wang · Dongbai Li · Gang Wang · Huan Zhang
[ East Exhibition Hall A-B ]
Abstract
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—*fidelity* and *contamination resistance*—to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing *question-level* evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy effectively balances fidelity and contamination resistance. No semantic-preserving strategy yields a significant improvement in resistance over the vanilla case (*i.e.*, no benchmark update) across *all* benchmarks, while semantic-altering strategies sacrifice fidelity for resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository …
Poster
Jiwoo Hong · Noah Lee · Eunki Kim · Guijin Son · Woojin Chung · Aman Gupta · Shao Tang · James Thorne
[ East Exhibition Hall A-B ]
Abstract
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5\% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40\% while …
Poster
Ruhan Wang · Zhiyong Wang · Chengkai Huang · Rui Wang · Tong Yu · Lina Yao · John C. S. Lui · Dongruo Zhou
[ East Exhibition Hall A-B ]
Abstract
For question-answering (QA) tasks, in-context learning (ICL) enables language models (LMs) to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters—incurring significant communication overhead—or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.
Poster
Alexandre Verine · Florian Le Bronnec · Kunhao Zheng · Alexandre Allauzen · yann CHEVALEYRE · benjamin negrevergne
[ East Exhibition Hall A-B ]
Abstract
Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.
Poster
Chaochen Gao · Xing W · Zijia Lin · Debing Zhang · Songlin Hu
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
Poster
Wei Huang · Haotong Qin · Yangdong Liu · Yawei Li · Qinshuo Liu · Xianglong Liu · Luca Benini · Michele Magno · Shiming Zhang · XIAOJUAN QI
[ East Exhibition Hall A-B ]
Abstract
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code isavailable at https://github.com/Aaronhuang-778/SliM-LLM.
Poster
Seongho Son · William Bankes · Sayak Ray Chowdhury · Brooks Paige · Ilija Bogunovic
[ East Exhibition Hall A-B ]
Abstract
Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose **Non-Stationary Direct Preference Optimisation (NS-DPO)** that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
Poster
Mohamad Chehade · Soumya Suvra Ghosal · Souradip Chakraborty · Avinash Reddy · Dinesh Manocha · Hao Zhu · Amrit Singh Bedi
[ East Exhibition Hall A-B ]
Abstract
Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies- optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference-time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing-based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
Poster
Kumar Ashutosh · Yossi Gandelsman · Xinlei Chen · Ishan Misra · Rohit Girdhar
[ East Exhibition Hall A-B ]
Abstract
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities intoyour favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
Spotlight Poster
Yi Xu · Laura Ruis · Tim Rocktäschel · Robert Kirk
[ East Exhibition Hall A-B ]
Abstract
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0\% $\rightarrow$ 96.4\% and 82.1\% $\rightarrow$ 86.3\% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
Poster
Stéphane Aroca-Ouellette · Natalie Mackraz · Barry-John Theobald · Katherine Metcalf
[ East Exhibition Hall A-B ]
Abstract
Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state-of-the-art method for inferring preferences) by 33\%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9\% improvement over ICL alone. Code: https://github.com/apple/ml-predict
Poster
Xin Zou · Yizhou WANG · Yibo Yan · Yuanhuiyi Lyu · Kening Zheng · Sirui Huang · Junkai Chen · Peijie Jiang · Jia Liu · Chang Tang · Xuming Hu
[ East Exhibition Hall A-B ]
Abstract
Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are prone to hallucinations, i.e., the generated content that is nonsensical or unfaithful to input sources.Unlike in LLMs, hallucinations in MLLMs often stem from the sensitivity of text decoder to visual tokens, leading to a phenomenon akin to "amnesia" about visual information.To address this issue, we propose MemVR, a novel decoding paradigm inspired by common cognition: when the memory of an image seen the moment before is forgotten, people will look at it again for factual answers. Following this principle, we treat visual tokens as supplementary evidence, re-injecting them into the MLLM through Feed Forward Network (FFN) as “key-value memory” at the middle trigger layer. This look-twice mechanism occurs when the model exhibits high uncertainty during inference, effectively enhancing factual alignment. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination across various MLLMs and excels in general benchmarks without incurring additional time overhead.
Poster
Quan Wei · Chung-Yiu Yau · Hoi To Wai · Yang Zhao · Dongyeop Kang · Youngsuk Park · Mingyi Hong
[ East Exhibition Hall A-B ]
Abstract
Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. Our code is available at https://github.com/OptimAI-Lab/RoSTE.
Poster
Changdae Oh · zhen fang · Shawn Im · Xuefeng Du · Sharon Li
[ East Exhibition Hall A-B ]
Abstract
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies.Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.
Poster
Louis Béthune · David Grangier · Dan Busbridge · Eleonora Gualdoni · Marco Cuturi · Pierre Ablin
[ East Exhibition Hall A-B ]
Abstract
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain.Finetuning presents two challenges: \textit{(i)} if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and \textit{(ii)} the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it.Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales.We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting.A key practical takeaway from our study is that injecting as little as $1\%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
Poster
Xiangkun Hu · Lemin Kong · Tong He · David Wipf
[ East Exhibition Hall A-B ]
Abstract
The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an implicit reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an explicit preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy …
Poster
Sean I. Young
[ East Exhibition Hall A-B ]
Abstract
In recent years, the compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resource-limited devices, reducing compute costs, and mitigating the environmental footprint due to large-scale AI infrastructure. Here, we establish the foundations of LLM quantization from a rate–distortion theory perspective and propose a quantization technique based on simple rate–distortion optimization. Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.
Poster
Bairu Hou · Qibin Chen · Jianyu Wang · Guoli Yin · Chong Wang · Nan Du · Ruoming Pang · Shiyu Chang · Tao Lei
[ East Exhibition Hall A-B ]
Abstract
With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning'', introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
Poster
Shiqi Chen · Tongyao Zhu · Ruochen Zhou · Jinghan Zhang · Siyang Gao · Juan Carlos Niebles · Mor Geva · Junxian He · Jiajun Wu · Manling Li
[ East Exhibition Hall A-B ]
Abstract
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under” or “behind” relationships between only two objects, pose significant challenges for current VLMs. We believe it is crucial to use the lens of mechanism interpretability, opening up the model and diving into model’s internal states to examine the interactions between image and text tokens during spatial reasoning. Our analysis of attention behaviors reveals significant differences in how VLMs allocate attention to image versus text. By tracing the areas of images that receive the highest attention scores throughout intermediate layers, we observe a notable pattern: errors often coincide with attention being misdirected towards irrelevant objects within the image. Moreover, such attention patterns exhibit substantial differences between familiar (e.g., “on the left side of ”) and unfamiliar (e.g.,“in front of ”) spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when the model exhibits high confidence, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) …
Poster
Konstantinos Vergopoulos · Mark Müller · Martin Vechev
[ East Exhibition Hall A-B ]
Abstract
Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench -- a benchmark that challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution.However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios running the riks of misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to …
Poster
Saaketh Narayan · Abhay Gupta · Mansheej Paul · Davis Blalock
[ East Exhibition Hall A-B ]
Abstract
Large language model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing to tune various hyperparameters, reduce model scale, or accept the overhead of computing dynamic scale factors. We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors or special hyperparameters, even at large model sizes. Our method, \textit{µnit Scaling (µS)}, also enables simple hyperparameter transfer across model widths, matched numerics across training and inference, and other desirable properties. µnit Scaling is straightforward to implement, consisting of a set of minimal interventions based on a first-principles analysis of transformer operations. We validate our method by training models with parameters ranging from 1B to 13B, performing all hidden linear layer computations in FP8. We achieve quality equal to higher-precision baselines while also training up to 33% faster.
Poster
Zican Hu · Wei Liu · Xiaoye Qu · Xiangyu Yue · Chunlin Chen · Zhi Wang · Yu Cheng
[ East Exhibition Hall A-B ]
Abstract
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
Poster
Jiawei Zhang · Shuang Yang · Bo Li
[ East Exhibition Hall A-B ]
Abstract
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model’s reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.
Poster
Ian Magnusson · Tai Nguyen · Ben Bogin · David Heineman · Jena Hwang · Luca Soldaini · Akshita Bhagia · Jiacheng Liu · Dirk Groeneveld · Oyvind Tafjord · Noah Smith · Pang Wei Koh · Jesse Dodge
[ East Exhibition Hall A-B ]
Abstract
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (∼ 80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval > 80% predictable at the target 1B scale with just 0.01% of the compute.
Poster
Xinyue Zeng · Haohui Wang · Junhong Lin · Jun Wu · Tyler Cody · Dawei Zhou
[ East Exhibition Hall A-B ]
Abstract
The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: *how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks?* In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive a *PAC-Bayesian Generalization Bound* that unveils fine-tuning dynamics of LLMs and then introduce *LensLLM*, a Neural Tangent Kernel (NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed *LensLLM* model and corresponding results at [LensLLM.io](https://github.com/Susan571/LENSLLM.git).
Poster
Jiawei Zhang · Xuan Yang · Taiqi Wang · Yu Yao · Aleksandr Petiushko · Bo Li
[ East Exhibition Hall A-B ]
Abstract
Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge.To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light => stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes.Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.
Poster
Wanyun Xie · Francesco Tonin · Volkan Cevher
[ East Exhibition Hall A-B ]
Abstract
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.
Poster
Tong Wu · Junzhe Shen · Zixia Jia · Yuxuan Wang · Zilong Zheng
[ East Exhibition Hall A-B ]
Abstract
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TokenSwift, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TokenSwift achieves over $3 \times$ speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TokenSwift as a scalable and effective solution at unprecedented lengths.
Poster
Roman Abramov · Felix Steinbauer · Gjergji Kasneci
[ East Exhibition Hall A-B ]
Abstract
Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns -- yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95--100\% accuracy on 2WikiMultiHopQA -- substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more …
Poster
Samira Abnar · Harshay Shah · Dan Busbridge · Alaaeldin Ali · Joshua M Susskind · Vimal Thilak
[ East Exhibition Hall A-B ]
Abstract
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance.These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.
Poster
Oliver Sieberling · Denis Kuznedelev · Eldar Kurtic · Dan Alistarh
[ East Exhibition Hall A-B ]
Abstract
The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the "importance" of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance.To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
Spotlight Poster
Thomas Zeng · Shuibai Zhang · Shutong Wu · Christian Classen · Daewon Chae · Ethan Ewer · Minjae Lee · Heeju Kim · Wonjun Kang · Jackson Kunde · Ying Fan · Jungtaek Kim · HYUNG IL KOO · Kannan Ramchandran · Dimitris Papailiopoulos · Kangwook Lee
[ East Exhibition Hall A-B ]
Abstract
Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce ***VersaPRM***, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
Poster
chengqian gao · Haonan Li · Liu Liu · Zeke Xie · Peilin Zhao · Zhiqiang Xu
[ East Exhibition Hall A-B ]
Abstract
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: *Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity*. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce *Selective DPO*, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16\% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, surpassing a series of DPO variants with different algorithmic adjustments. These results together illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO
Poster
Fangxu Yu · Lai Jiang · Haoqiang Kang · Shibo Hao · Lianhui Qin
[ East Exhibition Hall A-B ]
Abstract
The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning improves reasoning quality but requires vast labeled data, while reward-maximizing reinforcement learning finds top-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample divergent paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle …
Poster
David Reber · Sean Richardson · Todd Nief · Cristina Garbacea · Victor Veitch
[ East Exhibition Hall A-B ]
Abstract
Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs.However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the *causal* effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting *twice*. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
Poster
Yuchen Lin · Ronan Le Bras · Kyle Richardson · Ashish Sabharwal · Radha Poovendran · Peter Clark · Yejin Choi
[ East Exhibition Hall A-B ]
Abstract
We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity.” Notably, this limitation persists even with scaling model size and inference-time computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.
Poster
Dujian Ding · Ankur Mallick · Shaokun Zhang · Chi Wang · Daniel Madrigal · Mirian Hipolito Garcia · Menglin Xia · Laks Lakshmanan · Qingyun Wu · Victor Ruehle
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
Poster
Jinlong Pang · Na Di · Zhaowei Zhu · Jiaheng Wei · Hao Cheng · Chen Qian · Yang Liu
[ East Exhibition Hall A-B ]
Abstract
Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance.In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.
Poster
Fan Nie · Xiaotian Hou · Shuhang Lin · James Zou · Huaxiu Yao · Linjun Zhang
[ East Exhibition Hall A-B ]
Abstract
The propensity of large language models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored.In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can provide correct answers to given questions with high-probability correctness guarantees. We formulate hallucination detection as a hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that FactTest also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) benchmarks demonstrate that FactTest effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
Spotlight Poster
Yunzhuo Hao · Jiawei Gu · Huichen Wang · Linjie Li · Zhengyuan Yang · Lijuan Wang · Yu Cheng
[ East Exhibition Hall A-B ]
Abstract
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
Poster
Hui Dai · Ryan Teehan · Mengye Ren
[ East Exhibition Hall A-B ]
Abstract
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates. Code and data are available at https://agenticlearning.ai/daily-oracle.
Poster
Yaolun Zhang · Xiaogeng Liu · Chaowei Xiao
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose \textbf{MetaAgent}, a \textbf{finite state machine} based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.
Poster
Shuqing Luo · Pingzhi Li · Jie Peng · Yang Zhao · Yu Cao · Yu Cheng · Tianlong Chen
[ East Exhibition Hall A-B ]
Abstract
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first define $\textit{collaborative communication}$ to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them as $\textit{collaborated}$, which comprises $2$ cases as $\textit{intra-}$ and $\textit{inter-collaboration}$, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategically $\underline{\texttt{o}}$ptimize $\underline{\texttt{c}}$ollaborative $\underline{\texttt{c}}$omm$\underline{\texttt{u}}$nication for acce$\underline{\texttt{l}}$era$\underline{\texttt{t}}$ed MoE training and inference, dubbed $\textbf{\texttt{Occult}}$. Our designs are capable of $\underline{either}$ delivering exact results with reduced communication cost, $\underline{or}$ controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that $\texttt{Occult}$ can be faster than popular state-of-the-art inference or training frameworks (over 50% speed up across multiple tasks and models) with comparable or superior quality compared to the …
Spotlight Poster
Linxi Zhao · Yihe Deng · Weitong Zhang · Quanquan Gu
[ East Exhibition Hall A-B ]
Abstract
The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs to rectify the outputs of LVLMs. However, these approaches require either costly training or fine-tuning, or API access to proprietary LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs' generations. We release our code at https://github.com/Linxi-ZHAO/MARINE.
Poster
Xixi Wu · Yifei Shen · Fangzhou Ge · Caihua Shan · Yizhu Jiao · Xiangguo Sun · Hong Cheng
[ East Exhibition Hall A-B ]
Abstract
Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes 10 homophilic datasets, 4 heterophilic datasets, 8 LLM-based algorithms, 8 classic baselines, and 3 learning paradigms. Subsequently, we conducted extensive experiments, training and evaluating over 2,700 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size and prompt) that affect performance. Our findings uncover 8 insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future …
Poster
Langzhang Liang · Fanchen Bu · Zixing Song · Zenglin Xu · Shirui Pan · Kijung Shin
[ East Exhibition Hall A-B ]
Abstract
The message-passing paradigm of Graph Neural Networks often struggles with exchanging information across distant nodes typically due to structural bottlenecks in certain graph regions, a limitation known as over-squashing. To reduce such bottlenecks, graph rewiring, which modifies graph topology, has been widely used. However, existing graph rewiring techniques often overlook the need to preserve critical properties of the original graph, e.g., spectral properties. Moreover, many approaches rely on increasing edge count to improve connectivity, which introduces significant computational overhead and exacerbates the risk of over-smoothing. In this paper, we propose a novel graph-rewiring method that leverages spectral graph sparsification for mitigating over-squashing. Specifically, our method generates graphs with enhanced connectivity while maintaining sparsity and largely preserving the original graph spectrum, effectively balancing structural bottleneck reduction and graph property preservation. Experimental results validate the effectiveness of our approach, demonstrating its superiority over strong baseline methods in classification accuracy and retention of the Laplacian spectrum.
Poster
Gleb Rodionov · Liudmila Prokhorenkova
[ East Exhibition Hall A-B ]
Abstract
Neural algorithmic reasoning aims to capture computations with neural networks by training models to imitate the execution of classic algorithms. While common architectures are expressive enough to contain the correct model in the weights space, current neural reasoners struggle to generalize well on out-of-distribution data. On the other hand, classic computations are not affected by distributional shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. To achieve this, we separate discrete and continuous data flows and describe the interaction between them. Trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on multiple algorithmic problems and achieve perfect test scores both in single-task and multitask setups. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.
Poster
Zebin Wang · Menghan Lin · Bolin Shen · Ken Anderson · Molei Liu · Tianxi Cai · Yushun Dong
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) have demonstrated remarkable utility across diverse applications, and their growing complexity has made Machine Learning as a Service (MLaaS) a viable platform for scalable deployment. However, this accessibility also exposes GNN to serious security threats, most notably model extraction attacks (MEAs), in which adversaries strategically query a deployed model to construct a high-fidelity replica. In this work, we evaluate the vulnerability of GNNs to MEAs and explore their potential for cost-effective model acquisition in non-adversarial research settings. Importantly, adaptive node querying strategies can also serve a critical role in research, particularly when labeling data is expensive or time-consuming. By selectively sampling informative nodes, researchers can train high-performing GNNs with minimal supervision, which is particularly valuable in domains such as biomedicine, where annotations often require expert input. To address this, we propose a node querying strategy tailored to a highly practical yet underexplored scenario, where bulk queries are prohibited, and only a limited set of initial nodes is available. Our approach iteratively refines the node selection mechanism over multiple learning cycles, leveraging historical feedback to improve extraction efficiency. Extensive experiments on benchmark graph datasets demonstrate our superiority over comparable baselines on accuracy, fidelity, and F1 score under …
Poster
Dominik Fuchsgruber · Tom Wollschläger · Johannes Bordne · Stephan Günnemann
[ East Exhibition Hall A-B ]
Abstract
While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an information-theoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model's layers. In contrast to non-graph domains, information about the node-level prediction target can *increase* with model depth if a node's features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.
Poster
Shuo Wang · Bokui Wang · Zhixiang Shen · Boyan Deng · zhao kang
[ East Exhibition Hall A-B ]
Abstract
Recent advances in CV and NLP have inspired researchers to develop general-purpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM's effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.
Poster
Saman Forouzandeh · Parham Moradi · Mahdi Jalili
[ East Exhibition Hall A-B ]
Abstract
This paper proposes SHARP-Distill (\textbf{S}peedy \textbf{H}ypergraph \textbf{A}nd \textbf{R}eview-based \textbf{P}ersonalised \textbf{Distill}ation), a novel knowledge distillation approach based on the teacher-student framework that combines Hypergraph Neural Networks (HGNNs) with language models to enhance recommendation quality while significantly improving inference time. The teacher model leverages HGNNs to generate user and item embeddings from interaction data, capturing high-order and group relationships, and employing a pre-trained language model to extract rich semantic features from textual reviews. We utilize a contrastive learning mechanism to ensure structural consistency between various representations. The student includes a shallow and lightweight GCN called CompactGCN designed to inherit high-order relationships while reducing computational complexity. Extensive experiments on real-world datasets demonstrate that SHARP-Distill achieves up to 68× faster inference time compared to HGNN and 40× faster than LightGCN while maintaining competitive recommendation accuracy.
Poster
Wei Zhuo · Han Yu · Guang Tan · Xiaoxiao Li
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments on 8 benchmarking datasets confirm the superiority of CGNN against 13 state-of-the-art methods.
Poster
Chendi Ge · Xin Wang · Zeyang Zhang · Hong Chen · Jiapei Fan · Longtao Huang · Hui Xue' · Wenwu Zhu
[ East Exhibition Hall A-B ]
Abstract
Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15 percent average improvement over the best baseline. To …
Poster
Hailong Jiang · Jianfeng Zhu · Yao Wan · Bo Fang · Hongyu Zhang · Ruoming Jin · Qiang Guan
[ East Exhibition Hall A-B ]
Abstract
Intermediate Representations (IRs) play a critical role in compiler design and program analysis, yet their comprehension by *Large Language Models* (LLMs) remains underexplored. In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs—GPT-4, GPT-3, DeepSeek, Gemma 2, Llama 3, and Code Llama—in understanding IRs. Specifically, we assess model performance across four core tasks: *control flow graph reconstruction*, *decompilation*, *code summarization*, and *execution reasoning*. While LLMs exhibit competence in parsing IR syntax and identifying high-level structures, they consistently struggle with instruction-level reasoning, especially in control flow reasoning, loop handling, and dynamic execution. Common failure modes include misinterpreting branching instructions, omitting critical operations, and relying on heuristic reasoning rather than on precise instruction-level logic. Our findings highlight the need for IR-specific enhancements in LLM design. We recommend fine-tuning on structured IR datasets and integrating control-flow-sensitive architectures to improve the models’ effectiveness on IR-related tasks. All the experimental data and source code are publicly available at [https://github.com/hjiang13/LLM4IR](https://github.com/hjiang13/LLM4IR).
Poster
Tianhao Wu · Janice Lan · Weizhe Yuan · Jiantao Jiao · JASON WESTON · Sainbayar Sukhbaatar
[ East Exhibition Hall A-B ]
Abstract
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to *any* task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.
Poster
Guangzhi Sun · Yudong Yang · Jimin Zhuang · Changli Tang · Yixuan Li · Wei Li · Zejun MA · Chao Zhang
[ East Exhibition Hall A-B ]
Abstract
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding. This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
Poster
Hyeong Kyu Choi · Maxim Khanov · Hongxin Wei · Sharon Li
[ East Exhibition Hall A-B ]
Abstract
Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings.Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.
Poster
Shahaf E. Finder · Ron Shapira Weber · Moshe Eliasof · Oren Freifeld · Eran Treister
[ East Exhibition Hall A-B ]
Abstract
Message-Passing Neural Networks (MPNNs) have become a cornerstone for processing and analyzing graph-structured data. However, their effectiveness is often hindered by phenomena such as over-squashing, where long-range dependencies or interactions are inadequately captured and expressed in the MPNN output. This limitation mirrors the challenges of the Effective Receptive Field (ERF) in Convolutional Neural Networks (CNNs), where the theoretical receptive field is underutilized in practice. In this work, we show and theoretically explain the limited ERF problem in MPNNs. Furthermore, inspired by recent advances in ERF augmentation for CNNs, we propose an Interleaved Multiscale Message-Passing Neural Networks (IM-MPNN) architecture to address these problems in MPNNs. Our method incorporates a hierarchical coarsening of the graph, enabling message-passing across multiscale representations and facilitating long-range interactions without excessive depth or parameterization. Through extensive evaluations on benchmarks such as the Long-Range Graph Benchmark (LRGB), we demonstrate substantial improvements over baseline MPNNs in capturing long-range dependencies while maintaining computational efficiency.
Poster
Mathilde Papillon · Guillermo Bernardez · Claudio Battiloro · Nina Miolane
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) effectively learn from relational data by leveraging graph symmetries. However, many real-world systems---such as biological or social networks---feature multi-way interactions that GNNs fail to capture. Topological Deep Learning (TDL) addresses this by modeling and leveraging higher-order structures, with Combinatorial Complex Neural Networks (CCNNs) offering a general and expressive approach that has been shown to outperform GNNs. However, TDL lacks the principled and standardized frameworks that underpin GNN development, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software for defining, building, and training GCCNs with unprecedented flexibility and ease.
Poster
Yuankai Luo · Lei Shi · Xiao-Ming Wu
[ East Exhibition Hall A-B ]
Abstract
Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs—GCN, GIN, and GatedGCN—enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus.
Spotlight Poster
Artem Moskalev · Mangal Prakash · Junjie Xu · Tianyu Cui · Rui Liao · Tommaso Mansi
[ East Exhibition Hall A-B ]
Abstract
Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self-attention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of $30k$ tokens $20 \times$ faster than the equivariant transformer and allows $72 \times$ longer context within the same budget.
Poster
Shuo Wang · Shunyang Huang · Jinghui Yuan · Zhixiang Shen · zhao kang
[ East Exhibition Hall A-B ]
Abstract
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the **Cooperation of Experts (CoE)** framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By transcending modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel **large margin** mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework’s feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability.
Poster
Gregoire Fournier · Sourav Medya
[ East Exhibition Hall A-B ]
Abstract
Graph neural networks (GNNs) have been widely used in various domains such as social networks, molecular biology, or recommendation systems. Concurrently, different explanations methods of GNNs have arisen to complement its blackbox nature. Explanations of the GNNs’ predictions can be categorized into two types—factual and counterfactual. Given a GNN trained on binary classification into “accept” and “reject” classes, a global counterfactual explanation consists in generating a small set of “accept” graphs relevant to all of the input “reject” graphs. The transformation of a “reject” graph into an “accept” graph is called a recourse. A common recourse explanation is a small set of recourse, from which every “reject” graph can be turned into an “accept” graph. Although local counterfactual explanations have been studied extensively, the problem of finding common recourse for global counterfactual explanation remains unexplored, particularly for GNNs. In this paper, we formalize the common recourse explanation problem, and design an effective algorithm, COMRECGC, to solve it. We benchmark our algorithm against strong baselines on four different real-world graphs datasets and demonstrate the superior performance of COMRECGC against the competitors. We also compare the common recourse explanations to the graph counterfactual explanation, showing that common recourse explanations are either comparable …
Poster
Zheng Gong · Ying Sun
[ East Exhibition Hall A-B ]
Abstract
Discrete Graph Diffusion Models (DGDMs) mark a pivotal advancement in graph generation, effectively preserving sparsity and structural integrity, thereby enhancing the learning of graph data distributions for diverse generative applications. Despite their potential, DGDMs are computationally intensive due to the numerous low-parameter yet high-computation operations, thereby increasing the need of inference acceleration. A promising solution to mitigate this issue is model quantization. However, existing quantization techniques for Image Diffusion Models (IDMs) face limitations in DGDMs due to differing diffusion processes, while Large Language Model (LLM) quantization focuses on reducing memory access latency of loading large parameters, unlike DGDMs, where inference bottlenecks are computations due to smaller model sizes. To fill this gap, we introduce Bit-DGDM, a post-training quantization framework for DGDMs which incorporates two novel ideas: (i) sparse-dense activation quantization sparsely modeling the activation outliers through adaptively selected, data-free thresholds in full-precision and quantizing the remaining to low-bit, and (ii) ill-conditioned low-rank decomposition decomposing the weights into low-rank component enable faster inference and an $\alpha$-sparsity matrix that models outliers. Extensive experiments demonstrate that Bit-DGDM not only reducing the memory usage from the FP32 baseline by up to $2.8\times$ and achieve up to $2.5\times$ speedup, but also achieve comparable performance to …
Poster
Alex Vitvitskyi · João Madeira Araujo · Marc Lackenby · Petar Veličković
[ East Exhibition Hall A-B ]
Abstract
As implied by the plethora of literature on graph rewiring, the choice of computational graph employed by a neural network can make a significant impact on its downstream performance. Certain effects related to the computational graph, such as under-reaching and over-squashing, may even render the model incapable of learning certain functions. Most of these effects have only been thoroughly studied in the domain of undirected graphs; however, recent years have seen a significant rise in interest in feedforward computational graphs: directed graphs without any back edges. In this paper, we study the desirable properties of a feedforward computational graph, discovering two important complementary measures: fidelity and mixing time, and evaluating a few popular choices of graphs through the lens of these measures. Our study is backed by both theoretical analyses of the metrics' asymptotic behaviour for various graphs, as well as correlating these metrics to the performance of trained neural network models using the corresponding graphs.
Poster
Ziyue Qiao · Qianyi Cai · Hao Dong · Jiawei Gu · Pengyang Wang · Meng Xiao · Xiao Luo · Hui Xiong
[ East Exhibition Hall A-B ]
Abstract
This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs.Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory" phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
Spotlight Poster
Reyhane Askari Hemmat · Mohammad Pezeshki · Elvis Dohmatob · Florian Bordes · Pietro Astolfi · Melissa Hall · Jakob Verbeek · Michal Drozdzal · Adriana Romero-Soriano
[ East Exhibition Hall A-B ]
Abstract
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.
Poster
Peter Pao-Huang · Mitchell Black · Xiaojie Qiu
[ East Exhibition Hall A-B ]
Abstract
Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce *Noise-Conditioned Graph Networks* (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise independent architectures on a variety of domains including $3$D point clouds, spatiotemporal transcriptomics, and images.
Poster
Wenhao SUN · Rong-Cheng Tu · Jingyi Liao · Zhao Jin · Dacheng Tao
[ East Exhibition Hall A-B ]
Abstract
Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video diffusion sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (**AsymRnR**), **a training-free and model-agnostic method to accelerate video DiTs**. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.
Poster
Sophia Hager · Aleem Khan · Andrew Wang · Nicholas Andrews
[ East Exhibition Hall A-B ]
Abstract
Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate …
Spotlight Poster
Thibaut Issenhuth · Sangchul Lee · Ludovic Dos Santos · Jean-Yves Franceschi · Chansoo Kim · alain rakotomamonjy
[ East Exhibition Hall A-B ]
Abstract
Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network.They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network.In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field.The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit.To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model.We prove that this flow reduces the previously identified discrepancy and the noise-data transport cost.Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance. The code is available at https://github.com/thibautissenhuth/consistency_GC.
Poster
Leo Brunswic · Mateo Clémente · Rui Heng Yang · Adam Sigal · Amir Rasouli · Yinchuan Li
[ East Exhibition Hall A-B ]
Abstract
Generative Flow Networks (GFNs) were initially introduced on directed non-acyclic graphs to sample from an unnormalized distribution density. Recent works have extended the theoretical framework for generative methods allowing more flexibility and enhancing application range. However, many challenges remain in training GFNs in continuous settings and for imitation learning (IL), including intractability of flow-matching loss, limited tests of non-acyclic training, and the need for a separate reward model in imitation learning. The present work proposes a family of generative flows called Ergodic Generative Flows (EGFs) which are used to address the aforementioned issues. First, we leverage ergodicity to build simple generative flows with finitely many globally defined transformations (diffeomorphisms) with universality guarantees and tractable flow-matching loss (FM loss). Second, we introduce a new loss involving cross-entropy coupled to weak flow-matching control, coined KL-weakFM loss. It is designed for IL training without a separate reward model. We evaluate IL-EGFs on toy 2D tasks and real-world datasets from NASA on the sphere, using the KL-weakFM loss. Additionally, we conduct toy 2D reinforcement learning experiments with a target reward, using the FM loss.
Poster
Kulin Shah · Alkis Kalavasis · Adam Klivans · Giannis Daras
[ East Exhibition Hall A-B ]
Abstract
There is strong empirical evidence that the stateof-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.
Poster
Xiaojun Xu · jinghan jia · Yuanshun Yao · Yang Liu · Hang Li
[ East Exhibition Hall A-B ]
Abstract
We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation.
Poster
Ruofeng Yang · Bo Jiang · Cheng Chen · Shuai Li
[ East Exhibition Hall A-B ]
Abstract
Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the …
Poster
David Hayden · Mao Ye · Timur Garipov · Gregory Meyer · Carl Vondrick · Zhao Chen · Yuning Chai · Eric M. Wolff · Siddhartha Srinivasa
[ East Exhibition Hall A-B ]
Abstract
It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.
Poster
Shanchuan Lin · Xin Xia · Yuxi Ren · Ceyuan Yang · Xuefeng Xiao · Lu Jiang
[ East Exhibition Hall A-B ]
Abstract
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model can generate two-second, 1280x720, 24fps videos in real-time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Poster
Muskan Dosi · Chiranjeev Chiranjeev · Kartik Thakral · Mayank Vatsa · Richa Singh
[ East Exhibition Hall A-B ]
Abstract
Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce \textbf{HyperSphereDiff} to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold.
Poster
Kuan Po Huang · Shu-wen Yang · Huy Phan · Bo-Ru Lu · Byeonggeun Kim · Sashank Macha · Qingming Tang · Shalini Ghosh · Hung-yi Lee · Chieh-Chi Kao · Chao Wang
[ East Exhibition Hall A-B ]
Abstract
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.
Poster
Alexander Kolesov · S. Manukhov · Vladimir Palyulin · Aleksandr Korotin
[ East Exhibition Hall A-B ]
Abstract
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modelling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the capacitor's electrostatic field using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.
Poster
Eric Frankel · Sitan Chen · Jerry Li · Pang Wei Koh · Lillian Ratliff · Sewoong Oh
[ East Exhibition Hall A-B ]
Abstract
Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reverse-time ordinary differential equation (ODE).Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model.However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers.In this work, we propose a new method that learns a good solver for the DM, which we call **S**olving **for** the **S**olver (**S4S**).S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver.We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling.In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance.Building on top of this, we also propose **S4S-Alt**, which optimizes both the solver and the discretization schedule.By exploiting the full design space of DM solvers, with …
Poster
Dongya Jia · Zhuo Chen · Jiawei Chen · Chenpeng Du · Jian Wu · Jian Cong · Xiaobin Zhuang · Chumin Li · Zhen Wei · Yuping Wang · Yuxuan Wang
[ East Exhibition Hall A-B ]
Abstract
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes.In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands.DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings, and the diffusion transformer subsequently generates the next patch based on the output of the language model.For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
Spotlight Poster
Zian Li · Cai Zhou · Xiyuan Wang · Xingang Peng · Muhan Zhang
[ East Exhibition Hall A-B ]
Abstract
Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a high-quality molecule in a goal-oriented way. Leveraging EDM and SemlaFlow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50\% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed.
Poster
Zhuofan Zong · Dongzhi Jiang · Bingqi Ma · Guanglu Song · Hao Shao · Dazhong Shen · Yu Liu · Hongsheng Li
[ East Exhibition Hall A-B ]
Abstract
Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging or concatenating their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although tuning-based approaches can effectively extract consistent elements within multiple images through the training process, it necessitates test-time finetuning for each distinct image group. This paper introduces EasyRef, a plug-and-play adaption method that empowers diffusion models to condition consistent visual elements (e.g., style and human facial identity, etc.) across multiple reference images under instruction controls. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free and tuning-based methods, achieving superior aesthetic quality and …
Poster
Chao Li · Jiawei Fan · Anbang Yao
[ East Exhibition Hall A-B ]
Abstract
In this paper, we present $Morse$, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called $Dash$ and $Dot$ that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78$\times$ to 3.31$\times$ on average over a wide range …
Poster
Hohyun Kim · Seunggeun Lee · Min-hwan Oh
[ East Exhibition Hall A-B ]
Abstract
Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atom-based and fragment-based generation schemes. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.
Poster
Weilun Feng · Chuanguang Yang · Haotong Qin · Xiangqi Li · Yu Wang · Zhulin An · Libo Huang · Boyu Diao · Zixiang Zhao · Yongjun Xu · Michele Magno
[ East Exhibition Hall A-B ]
Abstract
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters.Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present **Q-VDiT**, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the *Token aware Quantization Estimator* (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce *Temporal Maintenance Distillation* (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by **1.9$\times$**.
Poster
Zheng Fang · Lichuan Xiang · Xu Cai · Kaicheng Zhou · Hongkai Wen
[ East Exhibition Hall A-B ]
Abstract
Spatial conditioning control offers a powerful way to guide diffusion‐based generative models. Yet, most implementations (e.g., ControlNet) rely on ad-hoc heuristics to choose which network blocks to control — an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that equips all diffusion blocks with control signals during training and employs a trainable gating mechanism to dynamically select which control signal to activate at each denoising step. By introducing a computation-aware loss, we can encourage the control signal to activate only when it benefits the generation quality. By eliminating manual control unit selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet and DiT architectures on different control methods, we show that our method can upgrade existing controllable generative models in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks to control. These results underscore the potential of a flexible, data‐driven approach for controlled diffusion and open new avenues for efficient generative model …
Poster
Grigoriy Ksenofontov · Aleksandr Korotin
[ East Exhibition Hall A-B ]
Abstract
The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB, which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. The code of CSBM is available at [this repository](https://github.com/gregkseno/csbm).
Poster
Pengsheng Guo · Alex Schwing
[ East Exhibition Hall A-B ]
Abstract
We study Variational Rectified Flow Matching, a framework that enhances classic rectified flow matching by modeling multi-modal velocity vector-fields. At inference time, classic rectified flow matching 'moves' samples from a source distribution to the target distribution by solving an ordinary differential equation via integration along a velocity vector-field. At training time, the velocity vector-field is learnt by linearly interpolating between coupled samples one drawn from the source and one drawn from the target distribution randomly. This leads to ''ground-truth'' velocity vector-fields that point in different directions at the same location, i.e., the velocity vector-fields are multi-modal/ambiguous. However, since training uses a standard mean-squared-error loss, the learnt velocity vector-field averages ''ground-truth'' directions and isn't multi-modal. In contrast, variational rectified flow matching learns and samples from multi-modal flow directions. We show on synthetic data, MNIST, CIFAR-10, and ImageNet that variational rectified flow matching leads to compelling results.
Poster
Elias Chaibub Neto
[ East Exhibition Hall A-B ]
Abstract
The development of deep generative models for tabular data is currently a very active research area in machine learning. These models, however, tend to be computationally heavy and require careful tuning of multiple model parameters. In this paper, we propose TabSDS - a lightweight, non-parametric, and model free alternative to tabular deep generative models which leverages rank and data shuffling transformations for generating synthetic data which closely approximates the joint probability distribution of the real data. We evaluate TabSDS against multiple baselines implemented in the Synthcity Python library across several datasets. TabSDS showed very competitive performance against all baselines (including TabDDPM - a strong baseline model for tabular data generation). Importantly, the execution time of TabSDS is orders of magnitude faster than the deep generative baselines, and also considerably faster than other computationally efficient baselines such as adversarial random forests.
Poster
Haoye Lu · Qifan Wu · Yaoliang Yu
[ East Exhibition Hall A-B ]
Abstract
Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain FID 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10%). We also provide theoretical guarantees that SFBD learns the true data distribution. These results underscore the value of limited clean pretraining, or pretraining on similar datasets. Empirical studies further validate and enrich our findings.
Poster
Floor Eijkelboom · Heiko Zimmermann · Sharvaree Vadgama · Erik Bekkers · Max Welling · Christian Andersson Naesseth · Jan-Willem van de Meent
[ East Exhibition Hall A-B ]
Abstract
We derive a controlled generation objective within the framework of Variational Flow Matching (VFM),which casts flow matching as a variational inference problem.We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.
Poster
Ruixiang Zhang · Shuangfei Zhai · Yizhe Zhang · James Thornton · Zijing Ou · Joshua M Susskind · Navdeep Jaitly
[ East Exhibition Hall A-B ]
Abstract
Discrete diffusion is a promising framework for modeling and generating discrete data.In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models.TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework.Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including fine-tuning using reward functions or preference data, and distillation of knowledge from pre-trained autoregressive models.These new capabilities stem from the core idea of TCSM, estimating the concrete score of the target distribution, which resides in the original (clean) data space. This allows seamless integration with reward functions and pre-trained models, which inherently only operate in the clean data space rather than the noisy intermediate spaces of diffusion processes.Our experiments on language modeling tasks demonstrate that TCSM matches or surpasses current methods. Additionally, TCSM is versatile, applicable to both pre-training and post-training scenarios, offering greater flexibility and sample efficiency.
Poster
Kuheli Pratihar · Debdeep Mukhopadhyay
[ East Exhibition Hall A-B ]
Abstract
Variational Autoencoders (VAEs), a class of latent-variable generative models, have seen extensive use in high-fidelity synthesis tasks, yet their loss landscape remains poorly understood. Prior theoretical works on VAE loss analysis have focused on their latent-space representational capabilities, both in the optimal and limiting cases. Although these insights have guided better VAE designs, they also often restrict VAEs to problem settings where classical algorithms, such as Principal Component Analysis (PCA), can trivially guarantee globally optimal solutions. In this work, we push the boundaries of our understanding of VAEs beyond these traditional regimes to tackle NP-hard sparse inverse problems, for which no classical algorithms exist. Specifically, we examine the nontrivial Sparse Linear Regression (SLR) problem of recovering optimal sparse inputs in the presence of an ill-conditioned design matrix having correlated features. We provably show that, under a linear encoder-decoder architecture incorporating the product of the SLR design matrix with a trainable, sparsity-promoting diagonal matrix, any minimum of VAE loss is guaranteed to be an optimal solution. This property is especially useful for identifying (a) a preconditioning factor that reduces the eigenvalue spread, and (b) the corresponding optimal sparse representation. Lastly, our empirical analysis with different types of design matrices validates these …
Poster
Le Tuyet Nhi PHAM · Dario Shariatian · Antonio Ocello · Giovanni Conforti · Alain Oliviero Durmus
[ East Exhibition Hall A-B ]
Abstract
This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in discrete space, where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process, with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, strengthening its theoretical alignment with score-based generative models while ensuring robustness and efficiency. We further establish convergence bounds for the algorithm under minimal assumptions and demonstrate its effectiveness through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight its strong performance in generating discrete structures. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.
Poster
Dongwoo Lee · Dong Bok Lee · Steven Adriaensen · Juho Lee · Sung Ju Hwang · Frank Hutter · Seon Joo Kim · Hae Beom Lee
[ East Exhibition Hall A-B ]
Abstract
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.
Poster
Chenghao Fan · zhenyi lu · Sichen Liu · Chengfeng Gu · Xiaoye Qu · Wei Wei · Yu Cheng
[ East Exhibition Hall A-B ]
Abstract
While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singularvalue decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. {Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture.}{However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture.} To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE’s efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT’s state-of-the-art performance, closing the gap with Full FT. Our code is available at: https://github.com/Facico/GOAT-PEFT
Poster
Chen Jin · Ryutaro Tanno · Amrutha Saseendran · Tom Diethe · Philip Teare
[ East Exhibition Hall A-B ]
Abstract
We introduce *Lavender*, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples---2.5\% of typical large-scale SFT datasets---and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30\% gains and a 68\% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. Code, training data, and models are available on the [project page](https://astrazeneca.github.io/vlm/).
Poster
Weizhi Gao · Zhichao Hou · Junqi Yin · Feiyi Wang · Linyu Peng · Xiaorui Liu
[ East Exhibition Hall A-B ]
Abstract
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.
Poster
Longlin Yu · Jiajun Zha · Tong Yang · Tianyu Xie · Xiangyu Zhang · Gary Chan · Cheng Zhang
[ East Exhibition Hall A-B ]
Abstract
Semi-implicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, offering a novel approach for multistep distillation of generative models at the distributional level. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2.
Poster
Hai Ci · Yiren Song · Pei Yang · Jinheng Xie · Mike Zheng Shou
[ East Exhibition Hall A-B ]
Abstract
Watermarking is essential for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that embeds user-specified watermark information seamlessly during the diffusion generation process. Unlike previous methods that modify diffusion modules to incorporate watermarks, WMAdapter is designed to keep all diffusion components intact, resulting in sharp, artifact-free images. To achieve this, we introduce two key innovations: (1) We develop a contextual adapter that conditions on the content of the cover image to generate adaptive watermark embeddings. (2) We implement an additional finetuning step and a hybrid finetuning strategy that suppresses noticeable artifacts while preserving the integrity of the diffusion components. Empirical results show that WMAdapter provides strong flexibility, superior image quality, and competitive watermark robustness.
Poster
Yuyang Wang · Anurag Ranjan · Joshua M Susskind · Miguel Angel Bautista Martin
[ East Exhibition Hall A-B ]
Abstract
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two-stage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.
Poster
Tingyu Zhu · Haoyu Liu · Ziyu Wang · Zhimin Jiang · Zeyu Zheng
[ East Exhibition Hall A-B ]
Abstract
Developing generative models to create or conditionally create symbolic music presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To address these challenges, we introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models. FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers, which is critical to improve the accuracy, listenability, and quality of generated music. This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effects of the FGG approach. We provide numerical experiments and subjective evaluation to demonstrate the effectiveness of our approach. We have published a demo page to showcase performances, which enables real-time interactive generation.
Poster
Ziyue Zhang · Mingbao Lin · Shuicheng YAN · Rongrong Ji
[ East Exhibition Hall A-B ]
Abstract
This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model's precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over off-the-shelf iterative optimization techniques. It can be easily combined with most existing inversion methods by only four lines of code. See code at https://github.com/potato-kitty/EasyInv.
Poster
Idan Achituve · Hai Victor Habi · Amir Rosenfeld · Arnon Netzer · Idit Diamant · Ethan Fetaya
[ East Exhibition Hall A-B ]
Abstract
In image processing, solving inverse problems is the task of finding plausible reconstructions of an image that was corrupted by some (usually known) degradation operator. Commonly, this process is done using a generative image model that can guide the reconstruction towards solutions that appear natural. The success of diffusion models over the last few years has made them a leading candidate for this task. However, the sequential nature of diffusion models makes this conditional sampling process challenging. Furthermore, since diffusion models are often defined in the latent space of an autoencoder, the encoder-decoder transformations introduce additional difficulties. To address these challenges, we suggest a novel sampling method based on sequential Monte Carlo (SMC) in the latent space of diffusion models. We name our method LD-SMC. We define a generative model for the data using additional auxiliary observations and perform posterior inference with SMC sampling based on a backward diffusion process. Empirical evaluations on ImageNet and FFHQ show the benefits of LD-SMC over competing methods in various inverse problem tasks and especially in challenging inpainting tasks.
Poster
Huayu Chen · Kai Jiang · Kaiwen Zheng · Jianfei Chen · Hang Su · Jun Zhu
[ East Exhibition Hall A-B ]
Abstract
Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free.
Poster
Christopher Wewer · Bartlomiej Pogodzinski · Bernt Schiele · Jan Eric Lenssen
[ East Exhibition Hall A-B ]
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%. Our [project website](https://geometric-rl.mpi-inf.mpg.de/srm/) provides additional videos, code, and the benchmark datasets.
Poster
Jeonghoon Kim · Byeongchan Lee · Cheonbok Park · Yeontaek Oh · Beomjun Kim · Taehwan Yoo · Seongjin Shin · Dongyoon Han · Jinwoo Shin · Kang Min Yoo
[ East Exhibition Hall A-B ]
Abstract
Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today’s large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layer **peripherally** around sublayers, a design we term **Peri-LN**. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation variance and gradient propagation. To validate our theoretical insight, we conduct extensive experiments on Transformers up to $3.2$B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement of LN.
Poster
Ajay Jaiswal · Yifan Wang · Lu Yin · Shiwei Liu · Runjin Chen · Jiawei Zhao · Ananth Grama · Yuandong Tian · Zhangyang “Atlas” Wang
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. \textit{Firstly,} we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. \textit{Secondly,} we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present \textit{Weight Low-Rank Projection} \textbf{(WeLore)} that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that \textit{LRCs tend to have better finetuning capabilities} and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning …
Poster
Shan Zhang · Aotian Chen · Yanpeng Sun · Jindong Gu · Yi-Yu Zheng · Piotr Koniusz · Kai Zou · Anton Hengel · Yuan Xue
[ East Exhibition Hall A-B ]
Abstract
Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine‐grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state‐of‐the‐art MLLMs highlights that fine‐grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically‐grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine‐grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.
Poster
Jack Cai · Ammar Vora · Randolph Zhang · Mark O'Connor · Mark Jeffrey
[ East Exhibition Hall A-B ]
Abstract
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent's NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.
Spotlight Poster
Greyson Brothers
[ East Exhibition Hall A-B ]
Abstract
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based *adaptive pooling* method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
Poster
Mingzhen He · Ruikai Yang · Hanling Tian · Youmei Qiu · Xiaolin Huang
[ East Exhibition Hall A-B ]
Abstract
Graph Transformers (GTs) have emerged as a promising approach for graph representation learning. Despite their successes, the quadratic complexity of GTs limits scalability on large graphs due to their pair-wise computations. To fundamentally reduce the computational burden of GTs, we propose a primal-dual framework that interprets the self-attention mechanism on graphs as a dual representation. Based on this framework, we develop Primphormer, an efficient GT that leverages a primal representation with linear complexity. Theoretical analysis reveals that Primphormer serves as a universal approximator for functions on both sequences and graphs, while also retaining its expressive power for distinguishing non-isomorphic graphs.Extensive experiments on various graph benchmarks demonstrate that Primphormer achieves competitive empirical results while maintaining a more user-friendly memory and computational costs.
Poster
Mike Heddes · Adel Javanmard · Kyriakos Axiotis · Thomas Fu · MohammadHossein Bateni · Vahab Mirrokni
[ East Exhibition Hall A-B ]
Abstract
Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters (e.g., 0.2\%). Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.
Poster
Yunfeng Liao · Jiawen Guan · Xiucheng Li
[ East Exhibition Hall A-B ]
Abstract
Deep models have recently achieved remarkable performances in solving partial differential equations (PDEs). The previous methods are mostly focused on PDEs arising in Euclidean spaces with less emphasis on the general manifolds with rich geometry. Several proposals attempt to account for the geometry by exploiting the spatial coordinates but overlook the underlying intrinsic geometry of manifolds. In this paper, we propose a Curvature-aware Graph Attention for PDEs on manifolds by exploring the important intrinsic geometric quantities such as curvature and discrete gradient operator. It is realized via parallel transport and tensor field on manifolds. To accelerate computation, we present three curvature-oriented graph embedding approaches and derive closed-form parallel transport equations, and a subtree partition method is also developed to promote parameter-sharing. Our proposed curvature-aware attention can be used as a replacement for vanilla attention, and experiments show that it significantly improves the performance of the existing methods for solving PDEs on manifolds. Our code is available at https://github.com/Supradax/CurvGT.
Poster
Mingi Jung · Saehyung Lee · Eunji Kim · Sungroh Yoon
[ East Exhibition Hall A-B ]
Abstract
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
Spotlight Poster
Dongyeop Lee · Kwanhee Lee · Jinseok Chung · Namhoon Lee
[ East Exhibition Hall A-B ]
Abstract
Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress.Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time.Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective.We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$.Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines.In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.
Poster
HUAKUN LUO · Haixu Wu · Hang Zhou · Lanxiang Xing · Yichen Di · Jianmin Wang · Mingsheng Long
[ East Exhibition Hall A-B ]
Abstract
Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13\% relative promotion across six standard PDE benchmarks and achieves over 20\% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.
Poster
Chao Ma · Wenbo Gong · Meyer Scetbon · Edward Meeds
[ East Exhibition Hall A-B ]
Abstract
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves ~50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.
Poster
Vladimir Braverman · Jon C. Ergun · Chen Wang · Samson Zhou
[ East Exhibition Hall A-B ]
Abstract
Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. We consider the problem of hierarchical clustering given auxiliary information from natural oracles in the learning-augmented framework. Our main results are algorithms that, given learning-augmented oracles, compute efficient approximate HC trees for the celebrated Dasgupta's and Moseley-Wang objectives that overcome known hardness barriers.
Poster
Ilias Diakonikolas · Daniel Kane · Jasper Lee · Thanasis Pittas · David Woodruff · Samson Zhou
[ East Exhibition Hall A-B ]
Abstract
We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.
Poster
Haoming Jing · Yorie Nakahira
[ East Exhibition Hall A-B ]
Abstract
Many systems contain latent variables that make their dynamics partially unidentifiable or cause distribution shifts in the observed statistics between offline and online data. However, existing control techniques often assume access to complete dynamics or perfect simulators with fully observable states, which are necessary to verify whether the system remains within a safe set (forward invariance) or safe actions are consistently feasible at all times. To address this limitation, we propose a technique for designing probabilistic safety certificates for systems with latent variables. A key technical enabler is the formulation of invariance conditions in probability space, which can be constructed using observed statistics in the presence of distribution shifts due to latent variables. We use this invariance condition to construct a safety certificate that can be implemented efficiently in real-time control. The proposed safety certificate can continuously find feasible actions that control long-term risk to stay within tolerance. Stochastic safe control and (causal) reinforcement learning have been studied in isolation until now. To the best of our knowledge, the proposed work is the first to use causal reinforcement learning to quantify long-term risk for the design of safety certificates. This integration enables safety certificates to efficiently ensure long-term safety in …
Poster
Francesco Cagnetta · Hyunmo Kang · Matthieu Wyart
[ East Exhibition Hall A-B ]
Abstract
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into units that are power-law distributed. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars—probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules’ distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the fine details of the learning curve, but not the exponent describing the large-scale behaviour.
Poster
Mahendra Singh Thapa · Rui Li
[ East Exhibition Hall A-B ]
Abstract
Personalized federated learning (PFL) based on Bayesian approach tackle the challenges from statistical heterogeneity of client data by computing a personalized posterior distribution over the parameters of each client's local model and constructing a global distribution by aggregating the parameters of these personalized posteriors. However, the heuristic aggregation methods introduce strong biases and result in global models with poor generalization. We thus propose a novel hierarchical Bayesian inference framework for PFL by specifying a conjugate hyper-prior over the parameters of the personalized posteriors. This allows us to jointly compute a global posterior distribution for aggregation and the personalized ones at local level. This hierarchical Bayesian inference framework achieves elegant balance between local personalization and global model robustness. Extensive empirical study shows that by effectively sharing the heterogeneous statistical strength across the local models while retaining their distinctive characteristics, our framework yields state-of-the-art performance. We also show that existing Bayesian PFLs are special cases of our framework.
Poster
Jinpeng Chen · Runmin Cong · Yuzhi Zhao · Hongzheng Yang · Guangneng Hu · Horace Ip · Sam Kwong
[ East Exhibition Hall A-B ]
Abstract
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting, thus adapting to evolving requirements. In this paper, we explore the forgetting caused by such incremental training, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model’s knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks’ answer styles, making the results unusable. On the other hand, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can conceal the model’s knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for data style transformations across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization to LoRA’s weight update matrices, enabling the model to retain existing competencies while …
Poster
Pranav Nair · Puranjay Datta · Jeff Dean · Prateek Jain · Aditya Kusupati
[ East Exhibition Hall A-B ]
Abstract
Quantizing model weights is critical for reducingthe communication and inference costs of largemodels. However, quantizing models – especiallyto low precisions like int4 or int2 – requires atrade-off in model quality; int2, in particular, isknown to severely degrade model quality. Consequently, practitioners are often forced to maintainmultiple models with different quantization levels or serve a single model that best satisfies thequality-latency trade-off. On the other hand, integer data types, such as int8, inherently possessa nested (Matryoshka) structure where smallerbit-width integers, like int4 or int2, are nestedwithin the most significant bits. Leveraging thisinsight, in this paper, we propose MatryoshkaQuantization (MatQuant), a novel multi-scalequantization technique that alleviates the aforementioned challenge. This technique allows us totrain and maintain a single quantized model butserve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’sco-training and co-distillation, int2 precision models extracted by MatQuant outperform standardint2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively.Finally, we demonstrate that by using an extra bitto represent outliers, a model with an effectiveprecision of 2.05-bit improves further by 6% withOmniQuant as the base algorithm.
Poster
Juyun Wee · Minjae Park · Jaeho Lee
[ East Exhibition Hall A-B ]
Abstract
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent---a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (**P**rompt-ro**u**ted **D**ynamic **D**epth Prun**ing**), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Poster
Lan Li · Da-Wei Zhou · Han-Jia Ye · De-Chuan Zhan
[ East Exhibition Hall A-B ]
Abstract
Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in old domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, where each expert is guided by specialized loss functions to learn features for specific frequency groups, effectively addressing intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism navigates the trade-off between preserving many-shot knowledge of previous domains and leveraging new data to improve few-shot class performance in earlier tasks. Extensive experimental results on four benchmark datasets demonstrate DCE’s state-of-the-art performance.
Poster
Juncheol Shin · Minsang Seok · Seonggon Kim · Eunhyeok Park
[ East Exhibition Hall A-B ]
Abstract
Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
Poster
Lorenzo Magnino · Yuchen Zhu · Mathieu Lauriere
[ East Exhibition Hall A-B ]
Abstract
Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multi-agent setting, named multi-agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite-space, discrete-time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean-field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean-field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable MAOS methods.
Poster
Lan-Cuong Nguyen · Quan Nguyen-Tri · Bang Khanh · Dung D. Le · Long Tran-Thanh · Khoat Than
[ East Exhibition Hall A-B ]
Abstract
Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, *our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability*. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
Poster
Andrei Panferov · Jiale Chen · Rush Tabesh · Mahdi Nikdan · Dan Alistarh
[ East Exhibition Hall A-B ]
Abstract
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment.While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by *directly training* over such representations, i.e., *Quantization-Aware Training (QAT)*, is still open: for example, a recent study put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new *trust gradient estimator* based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that …
Poster
Jinbo Wang · Mingze Wang · Zhanpeng Zhou · Junchi Yan · Weinan E · Lei Wu
[ East Exhibition Hall A-B ]
Abstract
Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important.In this paper, we uncover a clear **sharpness disparity** across these blocks, which intriguingly emerges early in training and persists throughout the training process.Building on this insight, we propose a novel **Blockwise Learning Rate (LR)** strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile.Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.
Poster
Judy Hanwen Shen
[ East Exhibition Hall A-B ]
Abstract
Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify *creative composition tasks* as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.
Poster
Lionel Wong · Ayman Ali · Raymond M Xiong · Shannon Shen · Yoon Kim · Monica Agrawal
[ East Exhibition Hall A-B ]
Abstract
Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their health-related queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. However, we argue that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations---such as the incorporation of communication pragmatics and enhanced comprehension of source documents---that could help mitigate these issues and extend beyond the medical domain.
Poster
Yan Shvartzshnaider · Vasisht Duddu
[ East Exhibition Hall A-B ]
Abstract
Machine learning community is discovering Contextual Integrity (CI) as a useful framework to assess the privacy implications of large language models (LLMs). This is an encouraging development.The CI theory emphasizes sharing information in accordance with *privacy norms* and can bridge the social, legal, political, and technical aspects essential for evaluating privacy in LLMs. However, this is also a good point to reflect on use of CI for LLMs. *This position paper argues that existing literature inadequately applies CI for LLMs without embracing the theory’s fundamental tenets.*Inadequate applications of CI could lead to incorrect conclusions and flawed privacy-preserving designs. We clarify the four fundamental tenets of CI theory, systematize prior work on whether they deviate from these tenets, and highlight overlooked issues in experimental hygiene for LLMs (e.g., prompt sensitivity, positional bias).
Poster
Hanna Wallach · Meera Desai · A. Feder Cooper · Angelina Wang · Chad Atalla · Solon Barocas · Su Lin Blodgett · Alexandra Chouldechova · Emily Corvi · P. Alex Dow · Jean Garcia-Gathright · Alexandra Olteanu · Nicholas Pangakis · Stefanie Reed · Emily Sheng · Dan Vann · Jennifer Wortman Vaughan · Matthew Vogel · Hannah Washington · Abigail Z. Jacobs
[ East Exhibition Hall A-B ]
Abstract
The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.
Poster
Frithjof Gressmann · Ashley Chen · Lily Xie · Nancy Amato · Lawrence Rauchwerger
[ East Exhibition Hall A-B ]
Abstract
Recent advances in bioengineering have enabled the creation of biological neural networks in vitro, significantly reducing the cost, ethical hurdles, and complexity of experimentation with genuine biological neural computation. In this position paper, we argue that this trend offers a unique and timely opportunity to put our understanding of neural computation to the test. By designing artificial neural networks that can interact and control living neural systems, it is becoming possible to validate computational models beyond simulation and gain empirical insights to help unlock more robust and energy-efficient next-generation AI systems. We provide an overview of key technologies, challenges, and principles behind this development and describe strategies and opportunities for novel machine learning research in this emerging field. We also discuss implications and fundamental questions that could be answered as this technology advances, exemplifying the longer-term impact of increasingly sophisticated in vitro neural networks.
Poster
Michael Kirchhof · Gjergji Kasneci · Enkelejda Kasneci
[ East Exhibition Hall A-B ]
Abstract
Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and …
Poster
Yan Zhuang · Qi Liu · Zachary Pardos · Patrick Kyllonen · Jiyun Zu · Zhenya Huang · Shijin Wang · Enhong Chen
[ East Exhibition Hall A-B ]
Abstract
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that *psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations*.
Poster
Taesoo Kim · Jinju Kim · Dongchan Kim · Jong Hwan Ko · Gyeong-Moon Park
[ East Exhibition Hall A-B ]
Abstract
The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers.The demo is available at https://speechunlearn.github.io/ .
Poster
Ethan Rathbun · Alina Oprea · Christopher Amato
[ East Exhibition Hall A-B ]
Abstract
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent's rewards, inducing values far outside the environment's natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These ``inception'' attacks manipulate the agent's training data -- inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent's task performance.
Spotlight Poster
Charita Dellaporta · Patrick O'Hara · Theodoros Damoulas
[ East Exhibition Hall A-B ]
Abstract
Decision making under uncertainty is challenging as the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs on the model’s parameters. However, minimising the expected risk under these beliefs can lead to suboptimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against model uncertainty by optimising the worst-case risk over a posterior-informed ambiguity set. We provide two such sets, based on the posterior expectation (DRO-BAS(PE)) or the posterior predictive (DRO-BAS(PP)) and prove that both admit, under conditions, strong dual formulations leading to efficient single-stage stochastic programs which are solved with a sample average approximation. For DRO-BAS(PE), this covers all conjugate exponential family members while for DRO-BAS(PP) this is shown under conditions on the predictive's moment generating function. Our DRO-BAS formulations outperform existing Bayesian DRO on the Newsvendor problem and achieve faster solve times with comparable robustness on the Portfolio problem.
Poster
Yu-Wei Chen · Raghu Pasupathy · Jordan A Awan
[ East Exhibition Hall A-B ]
Abstract
This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.
Poster
Sebastin Santy · Prasanta Bhattacharya · Manoel Ribeiro · Kelsey Allen · Sewoong Oh
[ East Exhibition Hall A-B ]
Abstract
Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content -- it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations -- rather than relying solely on external incentives -- can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
Spotlight Poster
Kirsten Morehouse · Siddharth Swaroop · Weiwei Pan
[ East Exhibition Hall A-B ]
Abstract
The proliferation of LLM bias probes introduces three challenges: we lack (1) principled criteria for selecting appropriate probes, (2) a system for reconciling conflicting results across probes, and (3) formal frameworks for reasoning about when and why experimental findings will generalize to real user behavior. In response, we propose a systematic approach to LLM social bias probing, drawing on insights from the social sciences. Central to this approach is EcoLevels—a novel framework that helps (a) identify appropriate bias probes (b) reconcile conflicting results, and (c) generate predictions about bias generalization. We ground our framework in the social sciences, as many LLM probes are adapted from human studies, and these fields have faced similar challenges when studying bias in humans. Finally, we outline five lessons that demonstrate how LLM bias probing can (and should) benefit from decades of social science research
Spotlight Poster
Seungwook Han · Jyothish Pari · Samuel Gershman · Pulkit Agrawal
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- the hallmarks of artificial general intelligence (AGI) -- remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM's reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a \textit{reasoning prior} for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing …
Poster
Xuandong Zhao · Xianjun Yang · Tianyu Pang · Chao Du · Lei Li · Yu-Xiang Wang · William Wang
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) are vulnerable to jailbreak attacks -- resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the **weak-to-strong** jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99\% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong.
Poster
Ada Tur · Nicholas Meade · Xing Han Lù · Alejandra Zambrano · Arkil Patel · Esin Durmus · Spandana Gella · Karolina Stanczak · Siva Reddy
[ East Exhibition Hall A-B ]
Abstract
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, a benchmark focused on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories---misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents.
Poster
Hanxun Huang · Sarah Erfani · Yige Li · Xingjun Ma · James Bailey
[ East Exhibition Hall A-B ]
Abstract
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce **X-Transfer**, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as **super transferability**—a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through **surrogate scaling**, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models.
Poster
Zhipeng Wei · Yuqi Liu · N. Benjamin Erichson
[ East Exhibition Hall A-B ]
Abstract
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.
Poster
Kazuki Egashira · Robin Staab · Mark Vero · Jingxuan He · Martin Vechev
[ East Exhibition Hall A-B ]
Abstract
With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama.cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($\Delta$=$88.7\%$), targeted content injection ($\Delta$=$85.0\%$), and benign instruction refusal ($\Delta$=$30.1\%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient …
Poster
Mahavir Dabas · Si Chen · Charles Fleming · Ming Jin · Ruoxi Jia
[ East Exhibition Hall A-B ]
Abstract
Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce **ACTOR** (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.
Poster
Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine
[ East Exhibition Hall A-B ]
Abstract
The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial *actions* and *effects* the AI behavior may lead to, along with *likelihood*, *severity*, and *immediacy* labels that describe potential impacts on *stakeholders*. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
Poster
Haoran Zhang · Aparna Balagopalan · Nassim Oufattole · Hyewon Jeong · Yan Wu · Jiacheng Zhu · Marzyeh Ghassemi
[ East Exhibition Hall A-B ]
Abstract
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.
Poster
Daiheng Gao · Shilin Lu · Wenbo Zhou · Jiaming Chu · Jie Zhang · Mengxi Jia · Bang Zhang · Zhaoxin Fan · Weiming Zhang
[ East Exhibition Hall A-B ]
Abstract
Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (e.g., SD v1.4). In this work, we introduce EraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.
Poster
Batuhan K. Karaman · ishmam zabir · Alon Benhaim · Vishrav Chaudhary · Mert Sabuncu · Xia Song
[ East Exhibition Hall A-B ]
Abstract
Achieving both high safety and high usefulness simultaneously in large language models has become a critical challenge in recent years.Models often exhibit unsafe behavior or adopt an overly cautious approach leading to frequent overrefusal of benign prompts, which reduces their usefulness. A major factor underlying these behaviors is how the models are finetuned and aligned, particularly the nature and extent of the data used.In this work, we examine how overgenerating finetuning data with advanced teacher models (e.g., GPT-4o)—covering both general-purpose and toxic prompts—affects safety and usefulness in instruction-following language models.Additionally, we present POROver, an alignment strategy designed for models that are highly safe but prone to overrefusal. POROver employs preference optimization algorithms and leverages completions from an advanced teacher model to reduce overrefusals while maintaining safety.Our results show that overgenerating completions for general-purpose prompts significantly boosts safety with only a minimal impact on usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 74.4% to 91.8% because of a substantial rise in safety. Moreover, overgeneration for toxic prompts raises usefulness from 11.1% to 57.6% while preserving safety. Finally, applying POROVer increases usefulness further—from 57.6% to 82.1%—while keeping safety at comparable levels.
Poster
Zihan Tan · Suyuan Huang · Guancheng Wan · Wenke Huang · He Li · Mang Ye
[ East Exhibition Hall A-B ]
Abstract
Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL only from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the class knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drifts occur, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate label signal disruption and a frequency alignment to address spectral client drifts. The combination of $\textbf{S}$patial and $\textbf{S}$pectral strategies forms our framework $S^2$FGL. Extensive experiments on multiple datasets demonstrate the superiority of $S^2$FGL. The code is available at https://github.com/Wonder7racer/S2FGL.git.
Poster
Yang Zhang · Er Jin · Yanfei Dong · Yixuan Wu · Phil Torr · Ashkan Khakzar · Johannes Stegmaier · Kenji Kawaguchi
[ East Exhibition Hall A-B ]
Abstract
Recent advances in generative models have demonstrated remarkable capabilities in producing high-quality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model.In this work, we address these issues by formulating a novel minimalist concept erasure objective based *only* on the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.
Poster
Mahdi Sabbaghi · Paul Kassianik · George Pappas · Amin Karbasi · Hamed Hassani
[ East Exhibition Hall A-B ]
Abstract
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
Poster
Lukas Helff · Felix Friedrich · Manuel Brack · Kristian Kersting · Patrick Schramowski
[ East Exhibition Hall A-B ]
Abstract
This paper introduces Llavaguard, a suite of VLM-based vision safeguards that address the critical need for reliable tools in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting Llavaguard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, Llavaguard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate Llavaguard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code, publicly available at https://ml-research.github.io/human-centered-genai/projects/llavaguard.
Poster
Ce Zhang · Yixin Han · Yafei Wang · Xiaodong Yan · Linglong Kong · Ting Li · Bei Jiang
[ East Exhibition Hall A-B ]
Abstract
Randomized response (RR) mechanisms constitute a fundamental and effective technique for ensuring label differential privacy (LabelDP). However, existing RR methods primarily focus on the response labels while overlooking the influence of covariates and often do not fully address optimality. To address these challenges, this paper explores optimal LabelDP procedures using RR mechanisms, focusing on achieving optimal estimation and inference in binary response models. We first analyze the asymptotic behaviors of RR binary response models and then optimize the procedure by maximizing the trace of the Fisher Information Matrix within the $\varepsilon$- and $(\varepsilon,\delta)$-LabelDP constraints. Our theoretical results indicate that the proposed methods achieve optimal LabelDP guarantees while maintaining statistical accuracy in binary response models under mild conditions. Furthermore, we develop private confidence intervals with nominal coverage for statistical inference. Extensive simulation studies and real-world applications confirm that our methods outperform existing approaches in terms of precise estimation, privacy protection, and reliable inference.
Poster
Alexander Bienstock · Ujjwal Kumar · Antigoni Polychroniadou
[ East Exhibition Hall A-B ]
Abstract
Federated Learning (FL) solutions with central Differential Privacy (DP) have seen large improvements in their utility in recent years arising from the matrix mechanism, while FL solutions with distributed (more private) DP have lagged behind. In this work, we introduce the distributed matrix mechanism to achieve the best-of-both-worlds; better privacy of distributed DP and better utility from the matrix mechanism. We accomplish this using a novel cryptographic protocol that securely transfers sensitive values across client committeesof different training iterations with constant communication overhead. This protocol accommodates the dynamic participation of users required by FL, including those that may drop out from the computation. We provide experiments which show that our mechanism indeed significantly improves the utility of FL models compared to previous distributed DP mechanisms, with little added overhead.
Spotlight Poster
Xingyu Zhou · Yulian Wu · Francesco Orabona
[ East Exhibition Hall A-B ]
Abstract
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
Spotlight Poster
Yizhuo Chen · Chun-Fu (Richard) Chen · Hsiang Hsu · Shaohan Hu · Tarek Abdelzaher
[ East Exhibition Hall A-B ]
Abstract
The growing Machine Learning (ML) services require extensive collections of user data, which may inadvertently include people's private information irrelevant to the services. Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks. Nevertheless, as we theoretically and empirically show in the paper, these methods reveal severe vulnerability because of a common weakness rooted in their adversarial training based strategies. To overcome this limitation, we propose a novel approach, PASS, designed to stochastically substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function soundly derived from information-theoretic objective defined for utility-preserving private attributes protection. The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS's effectiveness and generalizability.
Poster
Yusuke Yamasaki · Kenta Niwa · Daiki Chijiwa · Takumi Fukami · Takayuki Miura
[ East Exhibition Hall A-B ]
Abstract
We propose Plausible Token Amplification (PTA) to improve the accuracy of Differentially Private In-Context Learning (DP-ICL) using DP synthetic demonstrations. While Tang et al. empirically improved the accuracy of DP-ICL by limiting vocabulary space during DP synthetic demonstration generation, its theoretical basis remains unexplored. By interpreting ICL as implicit Bayesian inference on a concept underlying demonstrations, we not only provide theoretical evidence supporting Tang et al.'s empirical method but also introduce PTA, a refined method for modifying next-token probability distribution. Through the modification, PTA highlights tokens that distinctly represent the ground-truth concept underlying the original demonstrations. As a result, generated DP synthetic demonstrations guide the Large Language Model to successfully infer the ground-truth concept, which improves the accuracy of DP-ICL. Experimental evaluations on both synthetic and real-world text-classification datasets validated the effectiveness of PTA.
Poster
Yinong O Wang · Nivedha Sivakumar · Falaah Arif Khan · Katherine Metcalf · Adam Golinski · Natalie Mackraz · Barry-John Theobald · Luca Zappella · Nicholas Apostoloff
[ East Exhibition Hall A-B ]
Abstract
The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerf, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable benchmark for modern LLMs. Combining our metric and dataset, we provide insightful comparisons of eight open-source LLMs. For example, Mistral-8B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, this work provides a holistic framework for LLM evaluation by jointly assessing fairness and uncertainty, enabling the development of more transparent and accountable AI systems.
Poster
Junyi Chai · Taeuk Jang · Jing Gao · Xiaoqian Wang
[ East Exhibition Hall A-B ]
Abstract
While numerous work has been proposed to address fairness in machine learning, existing methods do not guarantee fair predictions under imperceptible feature perturbation, and a seemingly fair model can suffer from large group-wise disparities under such perturbation. Moreover, while adversarial training has been shown to be reliable in improving a model's robustness to defend against adversarial feature perturbation that deteriorates accuracy, it has not been properly studied in the context of adversarial perturbation against fairness. To tackle these challenges, in this paper, we study the problem of adversarial attack and adversarial robustness w.r.t. two terms: fairness and accuracy. From the adversarial attack perspective, we propose a unified structure for adversarial attacks against fairness which brings together common notions in group fairness, and we theoretically prove the equivalence of adversarial attacks against different fairness notions. Further, we derive the connections between adversarial attacks against fairness and those against accuracy. From the adversarial robustness perspective, we theoretically align robustness to adversarial attacks against fairness and accuracy, where robustness w.r.t. one term enhances robustness w.r.t. the other term. Our study suggests a novel way to unify adversarial training w.r.t. fairness and accuracy, and experiments show our proposed method achieves better robustness w.r.t. both …