Poster
Ning LU · Shengcai Liu · Jiahao Wu · Weiyu CHEN · Zhirui Zhang · Yew Soon ONG · Qi Wang · Ke Tang
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model’s alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
Poster
Rei Higuchi · Taiji Suzuki
[ East Exhibition Hall A-B ]
Abstract
Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model.This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences.To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO).DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling.We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure.Experiments demonstrate that DDRO achieves superior performance compared to existing methods, showcasing its effectiveness and potential for significant improvement.DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.
Poster
Zhuocheng Gong · Jian Guan · Wei Wu · Huishuai Zhang · Dongyan Zhao
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-Instruct-8B). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment algorithms against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for responsible deployment of powerful LLMs.
Poster
Lee Cohen · Connie Hong · Jack Hsieh · Judy Hanwen Shen
[ East Exhibition Hall A-B ]
Abstract
In an era of increasingly capable foundation models, job seekers are turning to generative AI tools to enhance their application materials. However, unequal access to and knowledge about generative AI tools can harm both employers and candidates by reducing the accuracy of hiring decisions and giving some candidates an unfair advantage. To address these challenges, we introduce a new variant of the strategic classification framework tailored to manipulations performed using large language models, accommodating varying levels of manipulations and stochastic outcomes. We propose a "two-ticket" scheme, where the hiring algorithm applies an additional manipulation to each submitted resume and considers this manipulated version together with the original submitted resume. We establish theoretical guarantees for this scheme, showing improvements for both the fairness and accuracy of hiring decisions when the true positive rate is maximized subject to a no false positives constraint. We further generalize this approach to an $n$-ticket scheme and prove that hiring outcomes converge to a fixed, group-independent decision, eliminating disparities arising from differential LLM access. Finally, we empirically validate our framework and the performance of our two-ticket scheme on real resumes using an open-source resume screening tool.
Poster
Francesco Tonin · Alex Lambert · Johan Suykens · Volkan Cevher
[ East Exhibition Hall A-B ]
Abstract
Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experimentsdemonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
Poster
Yixin Liu · Lie Lu · Jihui Jin · Lichao Sun · Andrea Fanelli
[ East Exhibition Hall A-B ]
Abstract
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces the Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. This work represents a significant step forward in protecting intellectual property and ensuring the authenticity of audio content in the era of generative AI.
Poster
Atefeh Gilani · Felipe Gomez · Shahab Asoodeh · Flavio Calmon · Oliver Kosut · Lalitha Sankar
[ East Exhibition Hall A-B ]
Abstract
We propose a unified optimization framework for designing continuous and discrete noise distributions that ensure differential privacy (DP) by minimizing Rényi DP, a variant of DP, under a cost constraint. Rényi DP has the advantage that by considering different values of the Rényi parameter $\alpha$, we can tailor our optimization for any number of compositions. To solve the optimization problem, we reduce it to a finite-dimensional convex formulation and perform preconditioned gradient descent. The resulting noise distributions are then compared to their Gaussian and Laplace counterparts. Numerical results demonstrate that our optimized distributions are consistently better, with significant improvements in $(\varepsilon, \delta)$-DP guarantees in the moderate composition regimes, compared to Gaussian and Laplace distributions with the same variance.
Poster
Matthew Wicker · Philip Sosnin · Igor Shilov · Adrianna Janik · Mark Müller · Yves-Alexandre de Montjoye · Adrian Weller · Calvin Tsay
[ East Exhibition Hall A-B ]
Abstract
We study private prediction where differential privacy is achieved by adding noise to the outputs of a non-private model. Existing methods rely on noise proportional to the global sensitivity of the model, often resulting in sub-optimal privacy-utility trade-offs compared to private training. We introduce a novel approach for computing dataset-specific upper bounds on prediction sensitivity by leveraging convex relaxation and bound propagation techniques. By combining these bounds with the smooth sensitivity mechanism, we significantly improve the privacy analysis of private prediction compared to global sensitivity-based approaches. Experimental results across real-world datasets in medical image classification and natural language processing demonstrate that our sensitivity bounds are can be orders of magnitude tighter than global sensitivity. Our approach provides a strong basis for the development of novel privacy preserving technologies.
Poster
Tianjie Ju · Yi Hua · Hao Fei · Zhenyu Shao · Yubin Zheng · Haodong Zhao · Mong-Li Lee · Wynne Hsu · Zhuosheng Zhang · Gongshen Liu
[ East Exhibition Hall A-B ]
Abstract
Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.
Spotlight Poster
Xiuyuan Wang · Chaochao Chen · Weiming Liu · Xinting Liao · Fan Wang · Xiaolin Zheng
[ East Exhibition Hall A-B ]
Abstract
With growing privacy concerns and the enforcement of data protection regulations, machine unlearning has emerged as a promising approach for removing the influence of forget data while maintaining model performance on retain data. However, most existing unlearning methods require access to the original training data, which is often impractical due to privacy policies, storage constraints, and other limitations. This gives rise to the challenging task of source-free unlearning, where unlearning must be accomplished without accessing the original training data. Few existing source-free unlearning methods rely on knowledge distillation and model retraining, which impose substantial computational costs. In this work, we propose the Data Synthesis-based Discrimination-Aware (DSDA) unlearning framework, which enables efficient source-free unlearning in two stages: (1) Accelerated Energy-Guided Data Synthesis (AEGDS), which employs Langevin dynamics to model the training data distribution while integrating Runge–Kutta methods and momentum to enhance efficiency. (2) Discrimination-Aware Multitask Optimization (DAMO), which refines the feature distribution of retain data and mitigates the gradient conflicts among multiple unlearning objectives. Extensive experiments on three benchmark datasets demonstrate that DSDA outperforms existing unlearning methods, validating its effectiveness and efficiency in source-free unlearning.
Poster
Yunzhen Yao · Lie He · Michael Gastpar
[ East Exhibition Hall A-B ]
Abstract
This paper considers the sample-efficiency of preference learning, which models and predicts human choices based on comparative judgments. The minimax optimal estimation error rate $\Theta(d/n)$ in classical estimation theory requires that the number of samples $n$ scales linearly with the dimensionality of the feature space $d$. However, the high dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods. To remedy this, we leverage sparsity in the preference model and establish sharp error rates. We show that under the sparse random utility model, where the parameter of the reward function is $k$-sparse, the minimax optimal rate can be reduced to $\Theta(k/n \log(d/k))$. Furthermore, we analyze the $\ell_{1}$-regularized estimator and show that it achieves near-optimal rate under mild assumptions on the Gram matrix. Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy.
Poster
Xingyi Yang · Constantin Venhoff · Ashkan Khakzar · Christian Schroeder de Witt · Puneet Dokania · Adel Bibi · Phil Torr
[ East Exhibition Hall A-B ]
Abstract
Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a mixture-of-experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. however, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
Poster
Mateo Espinosa Zarlenga · Gabriele Dominici · Pietro Barbiero · Zohreh Shams · Mateja Jamnik
[ East Exhibition Hall A-B ]
Abstract
In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level *concepts* (e.g., "stripes", "black") and then predict a task label from those concepts. In particular, we study the impact of *concept interventions* (i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term *leakage poisoning*, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce *MixCEM*, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.
Poster
Gonçalo Paulo · Alex Mallen · Caden Juang · Nora Belrose
[ East Exhibition Hall A-B ]
Abstract
While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which can be more easily interpretable. However, SAEs can have millions of distinct latents, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE latents using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a latent, which we find explains latents that are not recalled by existing methods. We propose guidelines for generating better interpretations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. Our code is available at https://github.com/EleutherAI/delphi.
Poster
Junwei Deng · Weijing Tang · Jiaqi Ma
[ East Exhibition Hall A-B ]
Abstract
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution---quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, …
Poster
Shreyas Kadekodi · Hayden McTavish · Berk Ustun
[ East Exhibition Hall A-B ]
Abstract
Many applications in machine learning and decision-making rely on procedures to aggregate human preferences.In such tasks, individual express ordinal preferences over a set of items through votes, ratings, or pairwise comparisons. We then summarize their collective preferences as a ranking. Standard methods for preference aggregation are designed to return rankings that arbitrate individual disagreements in ways that are faithful and fair. In this work, we introduce a paradigm for *selective aggregation*, where we can avoid the need to arbitrate dissent by abstaining from comparison. We summarize collective preferences as a *selective ranking* -- i.e., a partial order where we can only compare items where at least $100\cdot(1 - \tau)\%$ of individuals agree. We develop algorithms to build selective rankings that achieve all possible trade-offs between comparability and disagreement, and derive formal guarantees on their safety and stability. We conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show selective aggregation can promote transparency and robustness by revealing disagreement and abstaining from arbitration.
Poster
Lucy Farnik · Tim Lawson · Conor Houghton · Laurence Aitchison
[ East Exhibition Hall A-B ]
Abstract
Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might …
Poster
Meera Hahn · Wenjun Zeng · Nithish Kannen · Rich Galt · Kartikeya Badola · Been Kim · Zi Wang
[ East Exhibition Hall A-B ]
Abstract
User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024) , COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90\% of human …
Poster
Thibaud Gloaguen · Nikola Jovanović · Robin Staab · Martin Vechev
[ East Exhibition Hall A-B ]
Abstract
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.
Poster
Jiajun Chen · Jin Tian · Chris Quinn
[ East Exhibition Hall A-B ]
Abstract
Artificial intelligence will play a significant role in decision making in numerous aspects of society. Numerous fairness criteria have been proposed in the machine learning community, but there remains limited investigation into fairness as defined through specified attributes in a sequential decision-making framework. In this paper, we focus on causal logistic bandit problems where the learner seeks to make fair decisions, under a notion of fairness that accounts for counterfactual reasoning. We propose and analyze an algorithm by leveraging primal-dual optimization for constrained causal logistic bandits where the non-linear constraints are a priori unknown and must be learned in time. We obtain sub-linear regret guarantees with leading term similar to that for unconstrained logistic bandits (Lee et al., 2024) while guaranteeing sub-linear constraint violations. We show how to achieve zero cumulative constraint violations with a small increase in the regret bound.
Poster
Tim Vieira · Tianyu Liu · Clemente Pasti · Yahya Emara · Brian DuSell · Benjamin LeBrun · Mario Giulianelli · Juan Luis Gastaldi · Timothy O'Donnell · Ryan Cotterell
[ East Exhibition Hall A-B ]
Abstract
Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of *noncanonical* token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
Poster
Markelle Kelly · Alex Boyd · Samuel Showalter · Mark Steyvers · Padhraic Smyth
[ East Exhibition Hall A-B ]
Abstract
Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the problem of querying experts for class label predictions, using as few human queries as possible, and leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy.
Poster
Zhenting Wang · Chen Chen · Vikash Sehwag · Minzhou Pan · Lingjuan Lyu
[ East Exhibition Hall A-B ]
Abstract
The popularity of visual generative AI models like DALL-E 3, Stable Diffusion XL, Stable Video Diffusion, and Sora has been increasing. Through extensive evaluation, we discovered that the state-of-the-art visual generative models can generate content that bears a striking resemblance to characters protected by intellectual property rights held by major entertainment companies (such as Sony, Marvel, and Nintendo), which raises potential legal concerns. This happens when the input prompt contains the character's name or even just descriptive details about their characteristics. To mitigate such IP infringement problems, we also propose a defense method against it. In detail, we develop a revised generation paradigm that can identify potentially infringing generated content and prevent IP infringement by utilizing guidance techniques during the diffusion process. It has the capability to recognize generated content that may be infringing on intellectual property rights, and mitigate such infringement by employing guidance methods throughout the diffusion process without retrain or fine-tune the pretrained models. Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman demonstrate the effectiveness of the proposed defense method.
Poster
Manon Revel · Smitha Milli · Tyler Lu · Jamelle Watson-Daniels · Maximilian Nickel
[ East Exhibition Hall A-B ]
Abstract
Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation, However, this potential is often undermined by the frequency of toxic or low-quality exchanges that occur in these settings. To combat this, platforms increasingly leverage algorithmic ranking to facilitate higher-quality discussions, e.g., by using civility classifiers or forms of prosocial ranking. Yet, these interventions may also inadvertently reduce the visibility of legitimate viewpoints, undermining another key aspect of deliberation: representation of diverse views. We seek to remedy this problem by introducing guarantees of representation into these methods. In particular, we adopt the notion of *justified representation* (JR) from the social choice literature and incorporate a JR constraint into the comment ranking setting. We find that enforcing JR leads to greater inclusion of diverse viewpoints while still being compatible with optimizing for user engagement or other measures of conversational quality.
Poster
Hanshen Xiao · Zhen Yang · Edward Suh
[ East Exhibition Hall A-B ]
Abstract
This paper studies a range of AI/ML trust concepts, including memorization, data poisoning, and copyright, which can be modeled as constraints on the influence of data on a (trained) model, characterized by the outcome difference from a processing function (training algorithm). In this realm, we show that provable trust guarantees can be efficiently provided through a new framework termed Data-Specific Indistinguishability (DSI) to select trust-preserving randomization tightly aligning with targeted outcome differences, as a relaxation of the classic Input-Independent Indistinguishability (III). We establish both the theoretical and algorithmic foundations of DSI with the optimal multivariate Gaussian mechanism. We further show its applications to develop trustworthy deep learning with black-box optimizers. The experimental results on memorization mitigation, backdoor defense, and copyright protection show both the efficiency and effectiveness of the DSI noise mechanism.
Poster
Aaron Mueller · Atticus Geiger · Sarah Wiegreffe · Dana Arad · Iván Arcuschin · Adam Belfki · Yik Siu Chan · Jaden Fiotto-Kaufman · Tal Haklay · Michael Hanna · Jing Huang · Rohan Gupta · Yaniv Nikankin · Hadas Orgad · Nikhil Prakash · Anja Reusch · Aruna Sankaranarayanan · Shun Shao · Alessandro Stolfo · Martin Tutek · Amir Zur · David Bau · Yonatan Belinkov
[ East Exhibition Hall A-B ]
Abstract
How can we know whether new mechanistic interpretability methods achieve real improvements?In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components---and connections between them---most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.
Poster
Emiliano Penaloza · Tianyue Zhang · Laurent Charlin · Mateo Espinosa Zarlenga
[ East Exhibition Hall A-B ]
Abstract
Concept Bottleneck Models (CBMs) propose toenhance the trustworthiness of AI systems byconstraining their decisions on a set of humanunderstandable concepts. However, CBMs typically rely on datasets with assumedly accurateconcept labels—an assumption often violated inpractice which we show can significantly degradeperformance. To address this, we introduce theConcept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigatesthe negative impact of concept mislabeling onCBM performance. We provide an analysis onsome key properties of the CPO objective showing it directly optimizes for the concept’s posteriordistribution, and contrast it against Binary CrossEntropy (BCE) where we show CPO is inherentlyless sensitive to concept noise. We empiricallyconfirm our analysis finding that CPO consistentlyoutperforms BCE in three real-world datasets withand without added label noise
Poster
Subhash Kantamneni · Josh Engels · Senthooran Rajamanoharan · Max Tegmark · Neel Nanda
[ East Exhibition Hall A-B ]
Abstract
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs’ basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs’ utility on other tasks, our findings highlight …
Poster
Weiqiu You · Helen Qu · Marco Gatti · Bhuvnesh Jain · Eric Wong
[ East Exhibition Hall A-B ]
Abstract
Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery.
Poster
Nils Palumbo · Ravi Mangal · Zifan Wang · Saranya Vijayakumar · Corina Pasareanu · Somesh Jha
[ East Exhibition Hall A-B ]
Abstract
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a *mechanistic interpretation* itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Poster
Gaurav Ghosal · Pratyush Maini · Aditi Raghunathan
[ East Exhibition Hall A-B ]
Abstract
Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of *natural* sequences (those that resemble linguistically plausible text) become *mechanistically entangled* with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier to activate a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates clean isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.
Poster
Evan Sidrow · Alexandre Bouchard-Côté · Lloyd Elliott
[ East Exhibition Hall A-B ]
Abstract
Bayesian phylogenetics is vital for understanding evolutionary dynamics, and requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a single-linkage clustering and derive a closed-form density for the resulting distribution over trees. Unlike existing methods for ultrametric trees, our method performs inference over all of tree space, it does not require any Markov chain Monte Carlo subroutines, and our variational family is differentiable. Through experiments on benchmark genomic datasets and an application to the viral RNA of SARS-CoV-2, we demonstrate that our method achieves competitive accuracy while requiring significantly fewer gradient evaluations than existing state-of-the-art techniques.
Poster
Harry Amad · Nicolás Astorga · Mihaela van der Schaar
[ East Exhibition Hall A-B ]
Abstract
Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.
Poster
Carles Balsells-Rodas · Xavier Sumba · Tanmayee Narendra · Ruibo Tu · Gabriele Schweikert · Hedvig Kjellström · Yingzhen Li
[ East Exhibition Hall A-B ]
Abstract
Causal discovery, i.e., inferring underlying causal relationships from observational data, is highly challenging for AI systems. In a time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of nonstationary time series that are _conditionally stationary_, where the nonstationary behaviour is modeled as stationarity conditioned on a set of latent state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, with provable identifiablity for the state-dependent causal structures. Empirical experiments on nonlinear particle interaction data and gene regulatory networks demonstrate SDCI's superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting.
Spotlight Poster
Jie Hu · Yi-Ting Ma · Do-Young Eun
[ East Exhibition Hall A-B ]
Abstract
We propose a *history-driven target (HDT)* framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution $\\boldsymbol{\\mu}$. With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW's reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution $\\boldsymbol{\\pi}[\\mathbf{x}]$ to replace the original target $\\boldsymbol{\\mu}$ in any graph sampler, where $\\mathbf{x}$ represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution $\\boldsymbol{\\mu}$ and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.
Poster
Aaron Havens · Benjamin Kurt Miller · Bing Yan · Carles Domingo i Enrich · Anuroop Sriram · Daniel S. Levine · Brandon Wood · Bin Hu · Brandon Amos · Brian Karrer · Xiang Fu · Guan-Horng Liu · Ricky T. Q. Chen
[ East Exhibition Hall A-B ]
Abstract
We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods.Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution.We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates.We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems.To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry. Code \& and benchmarks provided at https://github.com/facebookresearch/adjoint_sampling.
Poster
Tiange Liu · Nikola Surjanovic · Miguel Biron-Lattes · Alexandre Bouchard-Côté · Trevor Campbell
[ East Exhibition Hall A-B ]
Abstract
Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selecting an appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methods---AutoStep MCMC---that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. Weprove that under mild conditions AutoStep MCMC is $\pi$-invariant, irreducible, and aperiodic, and obtain bounds on expected energy jump distance and cost per iteration. Empirical results examine the robustness and efficacy of our proposed step size selection procedure, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.
Poster
Honghua Zhang · Meihua Dang · Benjie Wang · Stefano Ermon · Nanyun Peng · Guy Van den Broeck
[ East Exhibition Hall A-B ]
Abstract
Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, enabling unprecedented scaling of PCs. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves state-of-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.
Poster
Jihao Andreas Lin · Sebastian Ament · Maximilian Balandat · David Eriksson · Jose Miguel Hernandez-Lobato · Eytan Bakshy
[ East Exhibition Hall A-B ]
Abstract
Applying Gaussian processes (GPs) to very large datasets remains a challenge due to limited computational scalability. Matrix structures, such as the Kronecker product, can accelerate operations significantly, but their application commonly entails approximations or unrealistic assumptions. In particular, the most common path to creating a Kronecker-structured kernel matrix is by evaluating a product kernel on gridded inputs that can be expressed as a Cartesian product. However, this structure is lost if any observation is missing, breaking the Cartesian product structure, which frequently occurs in real-world data such as time series. To address this limitation, we propose leveraging latent Kronecker structure, by expressing the kernel matrix of observed values as the projection of a latent Kronecker product. In combination with iterative linear system solvers and pathwise conditioning, our method facilitates inference of exact GPs while requiring substantially fewer computational resources than standard iterative methods. We demonstrate that our method outperforms state-of-the-art sparse and variational GPs on real-world datasets with up to five million examples, including robotics, automated machine learning, and climate applications.
Poster
Sidhanth Holalkere · David S Bindel · Silvia Sellán · Alexander Terenin
[ East Exhibition Hall A-B ]
Abstract
Poisson Surface Reconstruction is a widely-used algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray's trajectory, and (c) perform next-view planning on a per-ray basis. They also do not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations, unlike prior methods. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline.
Spotlight Poster
Xingyu Wu · Jibin Wu · Yu Zhou · Liang Feng · KC Tan
[ East Exhibition Hall A-B ]
Abstract
Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and meta-features. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces directed acyclic graph (DAG) to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging DAG to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned DAG and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability.
Poster
Zequn Yang · Hongfa Wang · Di Hu
[ East Exhibition Hall A-B ]
Abstract
Interactions between modalities—redundancy, uniqueness, and synergy—collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction.Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at https://github.com/GeWu-Lab/LSMI_Estimator.
Poster
Lijie Hu · Chenyang Ren · Zhengyu Hu · Hongbin Lin · Chenglong Wang · Zhen Tan · Weimin Lyu · Jingfeng ZHANG · Hui Xiong · Di Wang
[ East Exhibition Hall A-B ]
Abstract
Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.
Poster
Hanlin Yu · Arto Klami · Aapo Hyvarinen · Anna Korba · Lemir Omar Chehab
[ East Exhibition Hall A-B ]
Abstract
Density ratio estimation in high dimensions can be reframed as integrating a certain quantity, the time score, over probability paths which interpolate between the two densities. In practice, the time score has to be estimated based on samples from the two densities. However, existing methods for this problem remain computationally expensive and can yield inaccurate estimates. Inspired by recent advances in generative modeling, we introduce a novel framework for time score estimation, based on a conditioning variable. Choosing the conditioning variable judiciously enables a closed-form objective function. We demonstrate that, compared to previous approaches, our approach results in faster learning of the time score and competitive or better estimation accuracies of the density ratio on challenging tasks. Furthermore, we establish theoretical guarantees on the error of the estimated density ratio.
Poster
Kyle Heuton · Frederick Muench · Shikhar Shrestha · Thomas J Stopka · Michael Hughes
[ East Exhibition Hall A-B ]
Abstract
Optimal allocation of scarce resources is a common problem for decision makers faced with choosing a limited number of locations for intervention. Spatiotemporal prediction models could make such decisions data-driven.A recent performance metric called fraction of bestpossible reach (BPR) measures the impact of using a model’s recommended size K subset of sites compared to the best possible top-K in hindsight. We tackle two open problems related to BPR. First, we explore *how to rank* all sites numerically given a probabilistic model that predicts event counts jointly across sites. Ranking via the per-site mean is suboptimal for BPR. Instead, we offer a better ranking for BPR backed by decision theory. Second, we explore*how to train* a probabilistic model's parameters to maximize BPR. Discrete selection of K sites implies all-zero parameter gradients which prevent standard gradient training. We overcome this barrier via advances in perturbed optimizers. We further suggest a training objective that combines likelihood with a BPR constraint to deliver high-quality top-K rankings as well as good forecasts for all sites. We demonstrate our approach on two where-to-intervene applications: mitigating opioid-related fatal overdoses for public health and monitoring endangered wildlife.
Poster
Spencer Young · Porter Jenkins · Longchao Da · Jeffrey Dotson · Hua Wei
[ East Exhibition Hall A-B ]
Abstract
Neural networks capable of accurate, input-conditional uncertainty representation are essential for real-world AI systems. Deep ensembles of Gaussian networks have proven highly effective for continuous regression due to their ability to flexibly represent aleatoric uncertainty via unrestricted heteroscedastic variance, which in turn enables accurate epistemic uncertainty estimation. However, no analogous approach exists for $\textit{count}$ regression, despite many important applications. To address this gap, we propose the Deep Double Poisson Network (DDPN), a novel neural discrete count regression model that outputs the parameters of the Double Poisson distribution, enabling arbitrarily high or low predictive aleatoric uncertainty for count data and improving epistemic uncertainty estimation when ensembled. We formalize and prove that DDPN exhibits robust regression properties similar to heteroscedastic Gaussian models via learnable loss attenuation, and introduce a simple loss modification to control this behavior. Experiments on diverse datasets demonstrate that DDPN outperforms current baselines in accuracy, calibration, and out-of-distribution detection, establishing a new state-of-the-art in deep count regression.
Poster
Jacopo Talpini · Marco Savi · Giovanni Neglia
[ East Exhibition Hall A-B ]
Abstract
One-Shot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models.Our algorithm leverages a mixture of Laplace approximations for the clients' local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses.
Poster
Alan Amin · Andres Potapczynski · Andrew Wilson
[ East Exhibition Hall A-B ]
Abstract
To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make …
Spotlight Poster
Nuojin Cheng · Leonard Papenmeier · Stephen Becker · Luigi Nardi
[ East Exhibition Hall A-B ]
Abstract
Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and are often considered fundamentally distinct from EI. In this work, we challenge this prevailing perspective by introducing a unified theoretical framework, Variational Entropy Search, which reveals that EI and information-theoretic acquisition functions are more closely related than previously recognized. We demonstrate that EI can be interpreted as a variational inference approximation of the popular information-theoretic acquisition function, named Max-value Entropy Search. Building on this insight, we propose VES-Gamma, a novel acquisition function that balances the strengths of EI and MES. Extensive empirical evaluations across both low- and high-dimensional synthetic and real-world benchmarks demonstrate that VES-Gamma is competitive with state-of-the-art acquisition functions and in many cases outperforms EI and MES.
Poster
Nick Bishop · Daniel Jarne Ornia · Joel Dyer · Anisoara Calinescu · Michael Wooldridge
[ East Exhibition Hall A-B ]
Abstract
Simulation modeling offers a flexible approach to constructing high-fidelity synthetic representations of complex real-world systems. However, the increased complexity of such models introduces additional complications, for example when carrying out statistical inference procedures. This has motivated a large and growing literature on *likelihood-free* or *simulation-based* inference methods, which approximate (e.g., Bayesian) inference without assuming access to the simulator's intractable likelihood function. A hitherto neglected problem in the simulation-based Bayesian inference literature is the challenge of constructing minimally informative *reference priors* for complex simulation models. Such priors maximise an expected Kullback-Leibler distance from the prior to the posterior, thereby influencing posterior inferences minimally and enabling an ``objective'' approach to Bayesian inference that does not necessitate the incorporation of strong subjective prior beliefs. In this paper, we propose and test a selection of likelihood-free methods for learning reference priors for simulation models, using variational approximations to these priors and a variety of mutual information estimators. Our experiments demonstrate that good approximations to reference priors for simulation models are in this way attainable, providing a first step towards the development of likelihood-free objective Bayesian inference procedures.
Poster
Tam Le · Truyen Nguyen · Hideitsu Hino · Kenji Fukumizu
[ East Exhibition Hall A-B ]
Abstract
We investigate the Sobolev IPM problem for probability measures supported on a graph metric space. Sobolev IPM is an important instance of integral probability metrics (IPM), and is obtained by constraining a critic function within a unit ball defined by the Sobolev norm. In particular, it has been used to compare probability measures and is crucial for several theoretical works in machine learning. However, to our knowledge, there are no efficient algorithmic approaches to compute Sobolev IPM effectively, which hinders its practical applications. In this work, we establish a relation between Sobolev norm and weighted $L^p$-norm, and leverage it to propose a *novel regularization* for Sobolev IPM. By exploiting the graph structure, we demonstrate that the regularized Sobolev IPM provides a *closed-form* expression for fast computation. This advancement addresses long-standing computational challenges, and paves the way to apply Sobolev IPM for practical applications, even in large-scale settings. Additionally, the regularized Sobolev IPM is negative definite. Utilizing this property, we design positive-definite kernels upon the regularized Sobolev IPM, and provide preliminary evidences of their advantages for comparing probability measures on a given graph for document classification and topological data analysis.
Poster
Batiste Le Bars · Pierre Humbert
[ East Exhibition Hall A-B ]
Abstract
We study the question of volume optimality in split conformal regression, a topic still poorly understood in comparison to coverage control. Using the fact that the calibration step can be seen as an empirical volume minimization problem, we first derive a finite-sample upper-bound on the excess volume loss of the interval returned by the classical split method. This important quantity measures the difference in length between the interval obtained with the split method and the shortest oracle prediction interval. Then, we introduce *EffOrt*, a methodology that modifies the learning step so that the base prediction function is selected in order to minimize the length of the returned intervals. In particular, our theoretical analysis of the excess volume loss of the prediction sets produced by *EffOrt* reveals the links between the learning and calibration steps, and notably the impact of the choice of the function class of the base predictor. We also introduce *Ad-EffOrt*, an extension of the previous method, which produces intervals whose size adapts to the value of the covariate. Finally, we evaluate the empirical performance and the robustness of our methodologies.
Poster
Qilin Liao · Shuo Yang · Bo Zhao · Ping Luo · Hengshuang Zhao
[ East Exhibition Hall A-B ]
Abstract
Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64\% decrease in average FPR95 (40.31\% vs. 10.67\%) and a 7.27\% improvement in average AUROC (90.15\% vs. 97.42\%) on the Cifar-100 dataset.
Poster
Jaehyun Kwak · Izaaz Inhar · Se-Young Yun · Sung-Ju Lee
[ East Exhibition Hall A-B ]
Abstract
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.
Poster
Adam Breuer
[ East Exhibition Hall A-B ]
Abstract
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity)---exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Poster
Lin-Han Jia · Wen-Chao Hu · Jie-Jing Shao · Lan-Zhe Guo · Yu-Feng Li
[ East Exhibition Hall A-B ]
Abstract
The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data, so if we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts—issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic Combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs and introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels for some tasks, while for others the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.
Poster
Yaxin Hou · Yuheng Jia
[ East Exhibition Hall A-B ]
Abstract
This paper studies the long-tailed semi-supervised learning (LTSSL) with distribution mismatch, where the class distribution of the labeled training data follows a long-tailed distribution and mismatches with that of the unlabeled training data. Most existing methods introduce auxiliary classifiers (experts) to model various unlabeled data distributions and produce pseudo-labels, but the expertises of various experts are not fully utilized. We observe that different experts are good at predicting different intervals of samples, e.g., long-tailed expert is skilled in samples located in the head interval and uniform expert excels in samples located in the medium interval. Therefore, we propose a dynamic expert assignment module that can estimate the class membership (i.e., head, medium, or tail class) of samples, and dynamically assigns suitable expert to each sample based on the estimated membership to produce high-quality pseudo-label in the training phase and produce prediction in the testing phase. We also theoretically reveal that integrating different experts' strengths will lead to a smaller generalization error bound. Moreover, we find that the deeper features are more biased toward the head class but with more discriminative ability, while the shallower features are less biased but also with less discriminative ability. We, therefore, propose a multi-depth feature …
Spotlight Poster
Josh Givens · Song Liu · Henry Reeve
[ East Exhibition Hall A-B ]
Abstract
Score matching is a vital tool for learning the distribution of data with applications across many areas including diffusion processes, energy based modelling, and graphical model estimation. Despite all these applications, little work explores its use when data is incomplete. We address this by adapting score matching (and its major extensions) to work with missing data in a flexible setting where data can be partially missing over any subset of the coordinates. We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach. We provide finite sample bounds for our IW approach in finite domain settings and show it to have especially strong performance in small sample lower dimensional cases. Complementing this, we show our variational approach to be strongest in more complex high-dimensional settings which we demonstrate on graphical model estimation tasks on both real and simulated data.
Poster
Weihua Du · Yiming Yang · Sean Welleck
[ East Exhibition Hall A-B ]
Abstract
Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature’s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
Poster
Yunbei Zhang · Akshay Mehra · Shuaicheng Niu · Jihun Hamm
[ East Exhibition Hall A-B ]
Abstract
Continual Test-Time Adaptation (CTTA) seeks to adapt source pre-trained models to continually changing, unseen target domains. While existing CTTA methods assume structured domain changes with uniform durations, real-world environments often exhibit dynamic patterns where domains recur with varying frequencies and durations. Current approaches, which adapt the same parameters across different domains, struggle in such dynamic conditions—they face convergence issues with brief domain exposures, risk forgetting previously learned knowledge, or misapplying it to irrelevant domains. To remedy this, we propose **DPCore**, a method designed for robust performance across diverse domain change patterns while ensuring computational efficiency. DPCore integrates three key components: Visual Prompt Adaptation for efficient domain alignment, a Prompt Coreset for knowledge preservation, and a Dynamic Update mechanism that intelligently adjusts existing prompts for similar domains while creating new ones for substantially different domains. Extensive experiments on four benchmarks demonstrate that DPCore consistently outperforms various CTTA methods, achieving state-of-the-art performance in both structured and dynamic settings while reducing trainable parameters by 99% and computation time by 64% compared to previous approaches.
Poster
Jieting Wang · ZhangZelong Zhang · Feijiang Li · Yuhua Qian · Xinyan Liang
[ East Exhibition Hall A-B ]
Abstract
Deep learning excels at capturing complex data representations, yet quantifying the discriminative quality of these representations remains challenging. While unsupervised metrics often assess pairwise sample similarity, classification tasks fundamentally require class-level discrimination. To bridge this gap, we propose a novel loss function that evaluates representation discriminability via the Euclidean distance between the learned similarity matrix and the true class adjacency matrix.We identify random consistency—an inherent bias in Euclidean distance metrics—as a key obstacle to reliable evaluation, affecting both fairness and discrimination. To address this, we derive the expected Euclidean distance under uniformly distributed label permutations and introduce its closed-form solution, the Pure Square Euclidean Distance (PSED), which provably eliminates random consistency. Theoretically, we demonstrate that PSED satisfies heterogeneity and unbiasedness guarantees, and establish its generalization bound via the exponential Orlicz norm, confirming its statistical learnability.Empirically, our method surpasses conventional loss functions across multiple benchmarks, achieving significant improvements in accuracy, $F_1$ score, and class-structure differentiation. (Code is published in https://github.com/FeijiangLi/ICML2025-PSED)
Spotlight Poster
Xinyan Liang · Ruijie Sang · Yuhua Qian · Qian Guo · Feijiang Li · Liang Du
[ East Exhibition Hall A-B ]
Abstract
Automatic Modulation Classification (AMC) serves as a foundational pillar for cognitive radio systems, enabling critical functionalities including dynamic spectrum allocation, non-cooperative signal surveillance, and adaptive waveform optimization. However, practical deployment of AMC faces a fundamental challenge: prediction ambiguity arising from intrinsic similarity among modulation schemes and exacerbated under low signal-to-noise ratio (SNR) conditions. This phenomenon manifests as near-identical probability distributions across confusable modulation types, significantly degrading classification reliability. To address this, we propose Fuzzy Regularization-enhanced AMC (FR-AMC), a novel framework that integrates uncertainty quantification into the classification pipeline. The proposed FR has three features: (1) Explicitly model prediction ambiguity during backpropagation, (2) dynamic sample reweighting through adaptive loss scaling, (3) encourage margin maximization between confusable modulation clusters. Experimental results on benchmark datasets demonstrate that the FR achieves superior classification accuracy and robustness compared to compared methods, making it a promising solution for real-world spectrum management and communication applications.
Spotlight Poster
David Fleischer · David A Stephens · Archer Yang
[ East Exhibition Hall A-B ]
Abstract
We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF’s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.
Poster
Liangchen Liu · Nannan Wang · Xi Yang · Xinbo Gao · Tongliang Liu
[ East Exhibition Hall A-B ]
Abstract
Prompt learning is a cutting-edge parameter-efficient fine-tuning technique for pre-trained vision-language models (VLMs). Instead of learning a single text prompt, recent works have revealed that learning diverse text prompts can effectively boost the performances on downstream tasks, as the diverse prompted text features can comprehensively depict the visual concepts from different perspectives. However, diverse prompt learning demands enormous computational resources. This efficiency issue still remains unexplored. To achieve efficient and diverse prompt learning, this paper proposes a novel \textbf{Surrogate Prompt Learning (SurPL)} framework. Instead of learning diverse text prompts, SurPL directly generates the desired prompted text features via a lightweight \textbf{Surrogate Feature Generator (SFG)}, thereby avoiding the complex gradient computation procedure of conventional diverse prompt learning. Concretely, based on a basic prompted text feature, SFG can directly and efficiently generate diverse prompted features according to different pre-defined conditional signals. Extensive experiments indicate the effectiveness of the surrogate prompted text features, and show compelling performances and efficiency of SurPL on various benchmarks.
Spotlight Poster
Amber Yijia Zheng · Cedar Site Bai · Brian Bullins · Raymond A. Yeh
[ East Exhibition Hall A-B ]
Abstract
Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at https://github.com/amberyzheng/model-immunization-cond-num.
Poster
Guy Hacohen · Tinne Tuytelaars
[ East Exhibition Hall A-B ]
Abstract
Catastrophic forgetting -- the tendency of neural networks to forget previously learned data when learning new information -- remains a central challenge in continual learning. In this work, we adopt a behavioral approach, observing a connection between learning speed and forgetting: examples learned more quickly are less prone to forgetting. Focusing on replay-based continual learning, we show that the composition of the replay buffer -- specifically, whether it contains quickly or slowly learned examples -- has a significant effect on forgetting. Motivated by this insight, we introduce Speed-Based Sampling (SBS), a simple yet general strategy that selects replay examples based on their learning speed. SBS integrates easily into existing buffer-based methods and improves performance across a wide range of competitive continual learning benchmarks, advancing state-of-the-art results. Our findings underscore the value of accounting for the forgetting dynamics when designing continual learning algorithms.
Poster
Hyo Seo Kim · Dongyoon Han · Junsuk Choe
[ East Exhibition Hall A-B ]
Abstract
Machine unlearning aims to selectively remove specific knowledge from a trained model. Existing approaches, such as Task Arithmetic, fine-tune the model on the forget set to create a task vector (i.e., a direction in weight space) for subtraction from the original model's weight. However, their effectiveness is highly sensitive to hyperparameter selection, requiring extensive validation to identify the optimal vector from many fine-tuned candidates. In this paper, we propose a novel method that utilizes all fine-tuned models trained with varying hyperparameters instead of a single selection. Specifically, we aggregate the computed task vectors by retaining only the elements with consistent shared signs. The merged task vector is then negated to induce unlearning on the original model. Evaluations on zero-shot and standard image recognition tasks across twelve datasets and four backbone architectures show that our approach outperforms state-of-the-art methods while requiring similar or fewer computational resources. Code is available at https://github.com/naver-ai/negmerge.
Poster
Shanda Li · Shinjae Yoo · Yiming Yang
[ East Exhibition Hall A-B ]
Abstract
Fourier Neural Operators (FNOs) offer a principled approach for solving complex partial differential equations (PDEs). However, scaling them to handle more complex PDEs requires increasing the number of Fourier modes, which significantly expands the number of model parameters and makes hyperparameter tuning computationally impractical. To address this, we introduce $\mu$**Transfer-FNO**, a zero-shot hyperparameter transfer technique that enables optimal configurations, tuned on smaller FNOs, to be directly applied to billion-parameter FNOs _without_ additional tuning. Building on the Maximum Update Parametrization ($\mu$P) framework, we mathematically derive a parametrization scheme that facilitates the transfer of optimal hyperparameters across models with different numbers of Fourier modes in FNOs, which is validated through extensive experiments on various PDEs. Our empirical study shows that $\mu$Transfer-FNO reduces computational cost for tuning hyperparameters on large FNOs while maintaining or improving accuracy.
Poster
Ron Tsibulsky · Daniel Nevo · Uri Shalit
[ East Exhibition Hall A-B ]
Abstract
Despite the impressive advancements in modern machine learning, achieving robustness in Domain Generalization (DG) tasks remains a significant challenge. In DG, models are expected to perform well on samples from unseen test distributions (also called domains), by learning from multiple related training distributions. Most existing approaches to this problem rely on single-valued predictions, which inherently limit their robustness. We argue that set-valued predictors could be leveraged to enhance robustness across unseen domains, while also taking into account that these sets should be as small as possible. We introduce a theoretical framework defining successful set prediction in the DG setting, focusing on meeting a predefined performance criterion across as many domains as possible, and provide theoretical insights into the conditions under which such domain generalization is achievable. We further propose a practical optimization method compatible with modern learning architectures, that balances robust performance on unseen domains with small prediction set sizes. We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization.
Poster
Benjamin Leblanc · Mathieu Bazinet · Nathaniel D'Amours · Alexandre Drouin · Pascal Germain
[ East Exhibition Hall A-B ]
Abstract
Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bounds for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.
Poster
Junbiao Cui · Qin Yue · Jianqing Liang · Jiye Liang
[ East Exhibition Hall A-B ]
Abstract
Classification is a cornerstone of machine learning research. Most of the existing classifiers assume that the concepts corresponding to classes can be precisely defined. This notion diverges from the widely accepted understanding in cognitive science, which posits that real-world concepts are often inherently ambiguous. To bridge this big gap, we propose a Human Cognition-Inspired Hierarchical Fuzzy Learning Machine (HC-HFLM), which leverages a novel hierarchical alignment loss to integrate rich class knowledge from human knowledge system into learning process. We further theoretically prove that minimizing this loss can align the hierarchical structure derived from data with those contained in class knowledge, resulting in clear semantics and high interpretability. Systematic experiments verify that the proposed method can achieve significant gains in interpretability and generalization performance.
Spotlight Poster
Xinyan Liang · Shijie Wang · Yuhua Qian · Qian Guo · Liang Du · Bingbing Jiang · Tingjin Luo · Feijiang Li
[ East Exhibition Hall A-B ]
Abstract
Multi-view classification (MVC) based on the Dempster-Shafer theory has gained significant recognition for its reliability in safety-critical applications. However, existing methods predominantly focus on providing confidence levels for decision outcomes without explaining the reasoning behind these decisions. Moreover, the reliance on first-order statistical magnitudes of belief masses often inadequately capture the intrinsic uncertainty within the evidence. To address these limitations, we propose a novel framework termed Trusted Multi-view Classification Constrained with Expert Knowledge (TMCEK). TMCEK integrates expert knowledge to enhance feature-level interpretability and introduces a distribution-aware subjective opinion mechanism to derive more reliable and realistic confidence estimates. The theoretical superiority of the proposed uncertainty measure over conventional approaches is rigorously established. Extensive experiments conducted on three multi-view datasets for sleep stage classification demonstrate that TMCEK achieves state-of-the-art performance while offering interpretability at both the feature and decision levels. These results position TMCEK as a robust and interpretable solution for MVC in safety-critical domains. The code is available at https://github.com/jie019/TMCEK_ICML2025.
Poster
Tian Bai · Yue Zhao · Xiang Yu · Archer Yang
[ East Exhibition Hall A-B ]
Abstract
Selecting high-quality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this, we propose Multivariate Conformal Selection (mCS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal $p$-values, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: $\texttt{mCS-dist}$, using distance-based scores, and $\texttt{mCS-learn}$, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that mCS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.
Poster
Nitin Bisht · Xiuwen Gong · Guandong Xu
[ East Exhibition Hall A-B ]
Abstract
Although Recommender Systems (RS) have been well-developed for various fields of applications, they often suffer from a crisis of platform credibility with respect to RS confidence and fairness, which may drive users away, threatening the platform's long-term success. In recent years, some works have tried to solve these issues; however, they lack strong statistical guarantees. Therefore, there is an urgent need to solve both issues with a unifying framework with robust statistical guarantees. In this paper, we propose a novel and reliable framework called Equitable and Statistically Unbiased Recommendation (ENSUR)) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include ground-truth items with user-predefined high confidence/probability (e.g., 90\%); 2) to ensure user fairness across different groups; 3) to have minimum efficient average prediction set sizes.We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method and derive upper bounds of risk and fairness metrics to speed up the optimization process.Moreover, we provide rigorous theoretical analysis concerning risk and fairness control and minimum set size. Extensive experiments validate the effectiveness of the proposed framework, which aligns with our theoretical analysis.
Poster
Sicong Li · Qianqian Xu · Zhiyong Yang · Zitai Wang · Linchao Zhang · Xiaochun Cao · Qingming Huang
[ East Exhibition Hall A-B ]
Abstract
Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.
Poster
Qiyu Zhong · Yi Shan · Haobo Wang · Zhen Yang · Gengyu Lyu
[ East Exhibition Hall A-B ]
Abstract
In multi-view multi-label classification (MVML), each object has multiple heterogeneous views and is annotated with multiple labels. The key to deal with such problem lies in how to capture cross-view consistent correlations while excavate multi-label semantic relationships. Existing MVML methods usually employ two independent components to address them separately, and ignores their potential interaction relationships. To address this issue, we propose a novel Tensorized MVML method named TMvML, which formulates an MVML tensor classifier to excavate comprehensive cross-view feature correlations while characterize complete multi-label semantic relationships. Specifically, we first reconstruct the MVML mapping matrices as an MVML tensor classifier. Then, we rotate the tensor classifier and introduce a low-rank tensor constraint to ensure view-level feature consistency and label-level semantic co-occurrence simultaneously. To better characterize the low-rank tensor structure, we design a new Laplace Tensor Rank (LTR), which serves as a tighter surrogate of tensor rank to capture high-order fiber correlations within the tensor space. By conducting the above operations, our method can easily address the two key challenges in MVML via a concise LTR tensor classifier and achieve the extraction of both cross-view consistent correlations and multi-label semantic relationships simultaneously. Extensive experiments demonstrate that TMvML significantly outperforms state-of-the-art methods.
Poster
Kei Sen Fong · Mehul Motani
[ East Exhibition Hall A-B ]
Abstract
Symbolic Regression (SR) algorithms select expressions based on prediction performance while also keeping the expression lengths short to produce explainable white box models. In this context, SR algorithms can be evaluated by measuring the extent to which the expressions discovered are Pareto-optimal, in the sense of having the best R-squared score for a given expression length. This evaluation is most commonly done based on relative performance, in the sense that an SR algorithm is judged on whether it Pareto-dominates other SR algorithms selected in the analysis, without any indication on efficiency or attainable limits. In this paper, we explore absolute Pareto-optimal (APO) solutions instead, which have the optimal tradeoff between the multiple SR objectives, for 34 datasets in the widely-used SR benchmark, SRBench, by performing exhaustive search. Additionally, we include comparisons between eight numerical optimization methods. We extract, for every dataset, an APO front of expressions that can serve as a universal baseline for SR algorithms that informs researchers of the best attainable performance for selected sizes. The APO fronts provided serves as an important benchmark and performance limit for SR algorithms and is made publicly available at: https://github.com/kentridgeai/SRParetoFronts
Poster
Mao-Lin Luo · Zi-Hao Zhou · Tong Wei · Min-Ling Zhang
[ East Exhibition Hall A-B ]
Abstract
Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (**L**abel-specific **ADA**pter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at [https://github.com/MaolinLuo/LADA](https://github.com/MaolinLuo/LADA).
Poster
Srinath Dama · Kevin L Course · Prasanth B Nair
[ East Exhibition Hall A-B ]
Abstract
We present an operator-theoretic framework for temporal and spatio-temporal forecasting based on learning a *continuous time-shift operator*. Our operator learning paradigm offers a continuous relaxation of the discrete lag factor used in traditional autoregressive models, enabling the history of a system up to a given time to be mapped to its future values. We parametrize the time-shift operator using Khatri-Rao neural operators (KRNOs), a novel architecture based on non-stationary integral transforms with nearly linear computational scaling. Our framework naturally handles irregularly sampled observations and enables forecasting at super-resolution in both space and time. Extensive numerical studies across diverse temporal and spatio-temporal benchmarks demonstrate that our approach achieves state-of-the-art or competitive performance with leading methods.
Poster
Nate Veldt · Thomas Stanley · Benjamin Priest · Trevor Steil · Keita Iwabuchi · T.S. Jayram · Geoffrey Sanders
[ East Exhibition Hall A-B ]
Abstract
Finding a minimum spanning tree (MST) for $n$ points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes $\Omega(n^2)$ time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of trees using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components in the forest into a spanning tree. We prove that optimally solving step (2) still takes $\Omega(n^2)$ time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the heuristic forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.
Poster
Jiujiang Guo · Mankun Zhao · Wenbin Zhang · Tianyi Xu · Linying Xu · Yu Jian · Yu Mei · Yu Ruiguo
[ East Exhibition Hall A-B ]
Abstract
Existing research on temporal knowledge graph completion treats temporal information as supplementary, without simulating various features of facts from a temporal perspective. This work summarizes features of temporalized facts from both diachronic and synchronic perspectives: (1) Diachronicity. Facts often exhibit varying characteristics and trends across different temporal domains; (2) Synchronicity. In specific temporal contexts, various relations between entities influence each other, generating latent semantics. To track above issues, we design a quaternion-based model, TeDS, which divides timestamps into diachronic and synchronic timestamps to support dual temporal perception: (a) Two composite quaternions fusing time and relation information are generated by reorganizing synchronic timestamp and relation quaternions, and Hamilton operator achieves their interaction. (b) Each time point is sequentially mapped to an angle and converted to scalar component of a quaternion using trigonometric functions to build diachronic timestamps. We then rotate relation by using Hamilton operator between it and diachronic timestamp. In this way, TeDS achieves deep integration of relations and time while accommodating different perspectives. Empirically, TeDS significantly outperforms SOTA models on six benchmarks.
Poster
Ziang Zhou · Zhihao DING · Jieming Shi · Qing Li · Shiqi Shen
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) are pivotal in graph-based learning, particularly excelling in node classification. However, their scalability is hindered by the need for multi-hop data during inference, limiting their application in latency-sensitive scenarios. Recent efforts to distill GNNs into multi-layer perceptrons (MLPs) for faster inference often underutilize the layer-level insights of GNNs. In this paper, we present TINED, a novel approach that distills GNNs to MLPs on a layer-by-layer basis using Teacher Injection and Dirichlet Energy Distillation techniques.We focus on two key operations in GNN layers: feature transformation (FT) and graph propagation (GP). We recognize that FT is computationally equivalent to a fully-connected (FC) layer in MLPs. Thus, we propose directly transferring teacher parameters from an FT in a GNN to an FC layer in the student MLP, enhanced by fine-tuning. In TINED, the FC layers in an MLP replicate the sequence of FTs and GPs in the GNN. We also establish a theoretical bound for GP approximation.Furthermore, we note that FT and GP operations in GNN layers often exhibit opposing smoothing effects: GP is aggressive, while FT is conservative. Using Dirichlet energy, we develop a DE ratio to measure these effects and propose Dirichlet Energy Distillation to convey these …
Poster
Anqi Tang · Youming Chen · Shuchen Xue · Zhaoqiang Liu
[ East Exhibition Hall A-B ]
Abstract
Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.
Poster
Nikita Zozoulenko · Thomas Cass · Lukas Gonon
[ East Exhibition Hall A-B ]
Abstract
We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits of RFNNs. In the case of MSE loss, we obtain closed-form solutions to greedy layer-wise boosting with random features. For general loss functions, we show that fitting random feature residual blocks reduces to solving a quadratically constrained least squares problem. Through extensive numerical experiments on tabular datasets for both regression and classification, we show that RFRBoost significantly outperforms RFNNs and end-to-end trained MLP ResNets in the small- to medium-scale regime where RFNNs are typically applied. Moreover, RFRBoost offers substantial computational benefits, and theoretical guarantees stemming from boosting theory.
Poster
Fei Long · Xiaoou Li · jiaming Lv · Yang Haoyuan · Xianjun Cheng · Peihua Li
[ East Exhibition Hall A-B ]
Abstract
Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity--practically equivalent to the Pearson correlation coefficient--between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships--both linear and nonlinear--between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.
Poster
Xiaoyu Li · Zhao Song · Shenghao Xie
[ East Exhibition Hall A-B ]
Abstract
The Fourier transform is a fundamental tool in computer science and signal processing. In particular, when the signal is sparse in the frequency domain---having only $k$ distinct frequencies---sparse Fourier transform (SFT) algorithms can recover the signal in a sublinear time (proportional to the sparsity $k$). Most prior research focused on SFT for discrete signals, designing both randomized and deterministic algorithms for one-dimensional and high-dimensional discrete signals. However, SFT for continuous signals (i.e., $x^*(t)=\sum_{j=1}^k v_j e^{2\pi \mathbf{i} f_j t}$ for $t\in [0,T]$) is a more challenging task. The discrete SFT algorithms are not directly applicable to continuous signals due to the sparsity blow-up from the discretization. Prior to this work, there is a randomized algorithm that achieves an $\ell_2$ recovery guarantee in $\widetilde{O}(k\cdot \mathrm{polylog}(F/\eta))$ time, where $F$ is the band-limit of the frequencies and $\eta$ is the frequency gap.Nevertheless, whether we can solve this problem without using randomness remains open. In this work, we address this gap and introducethe first sublinear-time deterministic sparse Fourier transform algorithm in the continuous setting. Specifically, our algorithm uses $\widetilde{O}(k^2 \cdot \mathrm{polylog}(F/\eta))$ samples and $\widetilde{O}(k^2 \cdot \mathrm{polylog}(F/\eta))$ time to reconstruct the on-grid signal with arbitrary noise that satisfies a mild condition. This is the optimal recovery …
Poster
Yuanchao Xu · Kaidi Shao · Nikos Logothetis · Zhongwei Shen
[ East Exhibition Hall A-B ]
Abstract
Analyzing the long-term behavior of high-dimensional nonlinear dynamical systems remains a significant challenge. While the Koopman operator framework provides a powerful global linearization tool, current methods for approximating its spectral components often face theoretical limitations and depend on predefined dictionaries. Residual Dynamic Mode Decomposition (ResDMD) advanced the field by introducing the \emph{spectral residual} to assess Koopman operator approximation accuracy; however, its approach of only filtering precomputed spectra prevents the discovery of the operator's complete spectral information, a limitation known as the `spectral inclusion' problem. We introduce ResKoopNet (Residual-based Koopman-learning Network), a novel method that directly addresses this by explicitly minimizing the \emph{spectral residual} to compute Koopman eigenpairs. This enables the identification of a more precise and complete Koopman operator spectrum. Using neural networks, our approach provides theoretical guarantees while maintaining computational adaptability. Experiments on a variety of physical and biological systems show that ResKoopNet achieves more accurate spectral approximations than existing methods, particularly for high-dimensional systems and those with continuous spectra, which demonstrates its effectiveness as a tool for analyzing complex dynamical systems.
Poster
Vicente Balmaseda · Bokun Wang · Lin · Tianbao Yang
[ East Exhibition Hall A-B ]
Abstract
In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduce *GloFND*, an optimization-based approach that automatically learns on the fly the threshold for each anchor data to *identify* its false negatives during training. In contrast to previous methods for false negative discovery, our approach *globally* detects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at https://github.com/vibalcam/GloFND.
Poster
Andrew Draganov · Sharvaree Vadgama · Sebastian Damrich · Jan Böhm · Lucas Maes · Dmitry Kobak · Erik Bekkers
[ East Exhibition Hall A-B ]
Abstract
Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed.Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.
Spotlight Poster
Haoyang Li · Xin Wang · Zeyang Zhang · Zongyuan Wu · Linxin Xiao · Wenwu Zhu
[ East Exhibition Hall A-B ]
Abstract
Self-supervised learning (SSL) on graph-structured data has attracted considerable attention recently. Masked graph autoencoder, as one promising generative graph SSL approach that aims to recover masked parts of the input graph data, has shown great success in various downstream graph tasks. However, existing masked graph autoencoders fail to consider the degree of difficulty of recovering the masked edges that often have different impacts on the model performance, resulting in suboptimal node representations. To tackle this challenge, in this paper, we propose a novel curriculum based self-supervised masked graph autoencoder that is able to capture and leverage the underlying degree of difficulty of data dependencies hidden in edges, and design better mask-reconstruction pretext tasks for learning informative node representations. Specifically, we first design a difficulty measurer to identify the underlying structural degree of difficulty of edges during the masking step. Then, we adopt a self-paced scheduler to determine the order of masking edges, which encourages the graph encoder to learn from easy to difficult parts. Finally, the masked edges are gradually incorporated into the reconstruction pretext task, leading to high-quality node representations. Experiments on several real-world node classification and link prediction datasets demonstrate the superiority of our proposed method over state-of-the-art …
Poster
Mattia Opper · Siddharth N
[ East Exhibition Hall A-B ]
Abstract
We present Banyan, a model that efficiently learns semantic representations by leveraging explicit hierarchical structure. While transformers excel at scale, they struggle in low-resource settings. Conversely recent structured models have shown promise as efficient learners, but lack performance. Banyan bridges this gap with two key innovations: an entangled hierarchical tree structure and diagonalized message passing, enabling it to outperform larger transformer models with just 14 non-embedding parameters. It excels in low-resource settings, offering a viable alternative for under-represented languages and highlighting its potential for efficient, interpretable NLP in resource-constrained environments.
Poster
Shib S Dasgupta · Michael Boratko · Andrew McCallum
[ East Exhibition Hall A-B ]
Abstract
Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is “comedy and action, but not romance”. In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyperrectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30% overall.
Poster
Yuena Lin · Haichun Cai · Jun-Yi Hang · Haobo Wang · Zhen Yang · Gengyu Lyu
[ East Exhibition Hall A-B ]
Abstract
Graph contrastive learning (GCL) aims at narrowing positives while dispersing negatives, often causing a minority of samples with great similarities to gather as a small group. It results in two latent shortcomings in GCL: 1) **local cohesion** that a class cluster contains numerous independent small groups, and 2) **global sparseness** that these small groups (or isolated samples) dispersedly distribute among all clusters. These shortcomings make the learned distribution *only focus on local similarities among partial samples, which hinders the ability to capture the ideal global structural properties among real clusters, especially high intra-cluster compactness and inter-cluster separateness*. Considering this, we design a novel fuzzy boundary by extending the original cluster boundary with fuzzy set theory, which involves fuzzy boundary construction and fuzzy boundary contraction to address these shortcomings. The fuzzy boundary construction dilates the original boundaries to bridge the local groups, and the fuzzy boundary contraction forces the dispersed samples or groups within the fuzzy boundary to gather tightly, jointly mitigating local cohesion and global sparseness while forming the ideal global structural distribution. Extensive experiments demonstrate that a graph auto-encoder with the fuzzy boundary significantly outperforms current state-of-the-art GCL models in both downstream tasks and quantitative analysis.
Poster
Philippe Chlenski · Quentin Chu · Raiyan Khan · Kaizhu Du · Antonio Moretti · Itsik Pe'er
[ East Exhibition Hall A-B ]
Abstract
Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation respects manifold geometry while preserving the algorithmic properties that make decision trees effective. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. In benchmarks on a diverse suite of 57 classification, regression, and link prediction tasks, our product RFs ranked first on 29 tasks and came in the top 2 for 41. This highlights the value of product RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our method is available at https://github.com/pchlenski/manify.
Poster
Taeckyung Lee · Sorn Chottananurak · Junsu Kim · Jinwoo Shin · Taesik Gong · Sung-Ju Lee
[ East Exhibition Hall A-B ]
Abstract
Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback, which uses a few binary feedbacks from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves substantial accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort.
Poster
Côme Fiegel · Pierre Menard · Tadashi Kozuno · Michal Valko · Vianney Perchet
[ East Exhibition Hall A-B ]
Abstract
We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performances, with the best attainable rate being $\mathcal{O}(T^{-1/4})$ in contrast to the usual $\mathcal{O}(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate. The first algorithm leverages a straightforward tradeoff between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
Poster
Jiahui Zhu · Kihyun Yu · Dabeen Lee · Xin Liu · Honghao Wei
[ East Exhibition Hall A-B ]
Abstract
Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret $\tilde{\mathcal{O}}(\sqrt{K})$ and strong constraint violation $\tilde{\mathcal{O}}(\sqrt{K})$ without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.
Poster
Cun-Yuan Xing · Meng-Zhang Qian · Wu-Yang Chen · Wei Gao · Zhi-Hua Zhou
[ East Exhibition Hall A-B ]
Abstract
Feature evolvable learning studies the scenario where old features will vanish and new features will emerge when learning with data streams, and various methods have been developed by utilizing some useful relationships from old features to new features, rather than re-training from scratch. In this work, we focus on two fundamental problems: How to characterize the relationships between two different feature spaces, and how to exploit those relationships for feature evolvable learning. We introduce the Kernel Ortho-Mapping (KOM) discrepancy to characterize relationships between two different feature spaces via kernel functions, and correlate with the optimal classifiers learned from different feature spaces. Based on this discrepancy, we develop the one-pass algorithm for feature evolvable learning, which requires going through all instances only once without storing the entire or partial training data. Our basic idea is to take online kernel learning with the random Fourier features and incorporate some feature and label relationships via the KOM discrepancy for feature evolvable learning. We finally validate the effectiveness of our proposed method both theoretically and empirically.
Poster
Jae-Hong Lee
[ East Exhibition Hall A-B ]
Abstract
Test-time adaptation (TTA) addresses the machine learning challenge of adapting models to unlabeled test data from shifting distributions in dynamic environments. A key issue in this online setting arises from using unsupervised learning techniques, which introduce explicit gradient noise that degrades model weights. To invest in weight degradation, we propose a Bayesian weight enhancement framework, which generalizes existing weight-based TTA methods that effectively mitigate the issue. Our framework enables robust adaptation to distribution shifts by accounting for diverse weights by modeling weight distributions.Building on our framework, we identify a key limitation in existing methods: their neglect of time-varying covariance reflects the influence of the gradient noise. To address this gap, we propose a novel steady-state adaptation (SSA) algorithm that balances covariance dynamics during adaptation. SSA is derived through the solution of a stochastic differential equation for the TTA process and online inference. The resulting algorithm incorporates a covariance-aware learning rate adjustment mechanism. Through extensive experiments, we demonstrate that SSA consistently improves state-of-the-art methods in various TTA scenarios, datasets, and model architectures, establishing its effectiveness in instability and adaptability.
Poster
Shion Takeno · Yoshito Okura · Yu Inatsu · Tatsuya Aoyama · Tomonari Tanaka · Satoshi Akahane · Hiroyuki Hanada · Noriaki Hashimoto · Taro Murayama · Hanju Lee · Shinya Kojima · Ichiro Takeuchi
[ East Exhibition Hall A-B ]
Abstract
Gaussian process regression (GPR) or kernel ridge regression is a widely used and powerful tool for nonlinear prediction. Therefore, active learning (AL) for GPR, which actively collects data labels to achieve an accurate prediction with fewer data labels, is an important problem. However, existing AL methods do not theoretically guarantee prediction accuracy for target distribution. Furthermore, as discussed in the distributionally robust learning literature, specifying the target distribution is often difficult. Thus, this paper proposes two AL methods that effectively reduce the worst-case expected error for GPR, which is the worst-case expectation in target distribution candidates. We show an upper bound of the worst-case expected squared error, which suggests that the error will be arbitrarily small by a finite number of data labels under mild conditions. Finally, we demonstrate the effectiveness of the proposed methods through synthetic and real-world datasets.
Spotlight Poster
Ashkan Soleymani · Behrooz Tahmasebi · Stefanie Jegelka · Patrick Jaillet
[ East Exhibition Hall A-B ]
Abstract
We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (as opposed to approximate) invariances in this setting, partially addressing a question posed by Diaz (2025) regarding the avoidance of prohibitively large and computationally intensive group averaging methods in kernel regression with exact invariances. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.
Poster
Junhong Zhang · Zhihui Lai
[ East Exhibition Hall A-B ]
Abstract
Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we propose **Joker**, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show that **Joker** saves up to 90% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.
Poster
Chuan Liu · Chunshu Wu · Ruibing Song · Ang Li · Ying Nian Wu · Tong Geng
[ East Exhibition Hall A-B ]
Abstract
Function learning forms the foundation of numerous scientific and engineering tasks. While modern machine learning (ML) methods model complex functions effectively, their escalating complexity and computational demands pose challenges to efficient deployment. In contrast, natural dynamical systems exhibit remarkable computational efficiency in representing and solving complex functions. However, existing dynamical system approaches are limited by low expressivity and inefficient training. To this end, we propose EADS, an Expressive and self-Adaptive Dynamical System capable of accurately learning a wide spectrum of functions with extraordinary efficiency. Specifically, (1) drawing inspiration from biological dynamical systems, we integrate hierarchical architectures and heterogeneous dynamics into EADS, significantly enhancing its capacity to represent complex functions. (2) We propose an efficient on-device training method that leverages intrinsic electrical signals to update parameters, making EADS self-adaptive at negligible cost. Experimental results across diverse domains demonstrate that EADS achieves higher accuracy than existing works, while offering orders-of-magnitude speedups and energy efficiency over traditional neural network solutions on GPUs for both inference and training, showcasing its broader impact in overcoming computational bottlenecks across various fields.
Poster
Han Jiang · Xiaoyuan Yi · Zhihua Wei · Ziang Xiao · Shu Wang · Xing Xie
[ East Exhibition Hall A-B ]
Abstract
*Warning: Contains harmful model outputs.*Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges.Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer from *evaluation chronoeffect*, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated, *overestimating* ever-developing LLMs. To tackle this problem, we propose GETA, a novel *generative evolving testing* approach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA's evaluation results are more consistent with models' performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
Spotlight Poster
Jade Garcia Bourrée · Augustin Godinot · Sayan Biswas · Anne-Marie Kermarrec · Erwan Le Merrer · Gilles Tredan · Martijn de Vos · Milos Vujasinovic
[ East Exhibition Hall A-B ]
Abstract
Among the many technical challenges to enforcing AI regulations, one crucial yet underexplored problem is the risk of audit manipulation.This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users.In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor's prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets illustrate the maximum level of unfairness a platform can hide before being detected as malicious.Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.
Poster
Drew Prinster · Xing Han · Anqi Liu · Suchi Saria
[ East Exhibition Hall A-B ]
Abstract
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing---especially conformal test martingales (CTMs) and anytime-valid inference---offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
Poster
Mingyi Li · Michael R. Metel · Akiko Takeda
[ East Exhibition Hall A-B ]
Abstract
The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at https://github.com/lmingyi/LO-K-means.
Poster
Arhit Chakrabarti · Yang Ni · Debdeep Pati · Bani Mallick
[ East Exhibition Hall A-B ]
Abstract
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the "global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset.
Poster
Lei Yan · Xin Zhang · Qing Mai
[ East Exhibition Hall A-B ]
Abstract
Scientific and engineering applications are often heterogeneous, making it beneficial to account for latent clusters or sub-populations when learning low-dimensional subspaces in supervised learning, and vice versa. In this paper, we combine the concept of subspace clustering with model-based sufficient dimension reduction and thus generalize the sufficient dimension reduction framework from homogeneous regression setting to heterogeneous data applications. In particular, we propose the mixture of principal fitted components (mixPFC) model, a novel framework that simultaneously achieves clustering, subspace estimation, and variable selection, providing a unified solution for high-dimensional heterogeneous data analysis. We develop a group Lasso penalized expectation-maximization (EM) algorithm and obtain its non-asymptotic convergence rate. Through extensive simulation studies, mixPFC demonstrates superior performance compared to existing methods across various settings. Applications to real world datasets further highlight its effectiveness and practical advantages.
Poster
Suyuan Liu · Hao Yu · Hao Tan · KE LIANG · Siwei Wang · Shengju Yu · En Zhu · Xinwang Liu
[ East Exhibition Hall A-B ]
Abstract
Multi-view clustering (MVC) leverages complementary information from diverse data sources to enhance clustering performance. However, its practical deployment in distributed and privacy-sensitive scenarios remains challenging. Federated multi-view clustering (FMVC) has emerged as a potential solution, but existing approaches suffer from substantial limitations, including excessive communication overhead, insufficient privacy protection, and inadequate handling of missing views. To address these issues, we propose Efficient Federated Incomplete Multi-View Clustering (EFIMVC), a novel framework that introduces a localized optimization strategy to significantly reduce communication costs while ensuring theoretical convergence. EFIMVC employs both view-specific and shared anchor graphs as communication variables, thereby enhancing privacy by avoiding the transmission of sensitive embeddings. Moreover, EFIMVC seamlessly extends to scenarios with missing views, making it a practical and scalable solution for real-world applications. Extensive experiments on benchmark datasets demonstrate the superiority of EFIMVC in clustering accuracy, communication efficiency, and privacy preservation. Our code is publicly available at https://github.com/Tracesource/EFIMVC.
Poster
Steinar Laenen · Peter Macgregor · He Sun
[ East Exhibition Hall A-B ]
Abstract
In the kernel density estimation (KDE) problem, we are given a set $X$ of data points in $\mathbb{R}^d$, a kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$, and a query point $\mathbf{q} \in \mathbb{R}^d$, and the objective is to quickly output an estimate of $\sum_{\mathbf{x} \in X} k(\mathbf{q}, \mathbf{x})$.In this paper, we consider $\textsf{KDE}$ in the dynamic setting, and introduce a data structure that efficiently maintains the _estimates_ for a set of query points as data points are added to $X$ over time.Based on this, we design a dynamic data structure that maintains a sparse approximation of the fully connected similarity graph on $X$, and develop a fast dynamic spectral clustering algorithm.We further evaluate the effectiveness of our algorithms on both synthetic and real-world datasets.
Poster
Joyentanuj Das · Suranjan De · He Sun
[ East Exhibition Hall A-B ]
Abstract
Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, there has been a sequence of recent studies that highlight the importance of the inter-connection between clusters when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness.
Poster
Xinyue Chen · Jinfeng Peng · Yuhao Li · Xiaorong Pu · Yang Yang · Yazhou Ren
[ East Exhibition Hall A-B ]
Abstract
Recently, federated multi-view clustering (FedMVC) has gained attention for its ability to mine complementary clustering structures from multiple clients without exposing private data. Existing methods mainly focus on addressing the feature heterogeneity problem brought by views on different clients and mitigating it using shared client information. Although these methods have achieved performance improvements, the information they choose to share, such as model parameters or intermediate outputs, inevitably raises privacy concerns. In this paper, we propose an Effective and Secure Federated Multi-view Clustering method, ESFMC, to alleviate the dilemma between privacy protection and performance improvement. This method leverages the information-theoretic perspective to split the features extracted locally by clients, retaining sensitive information locally and only sharing features that are highly relevant to the task. This can be viewed as a form of privacy-preserving information sharing, reducing privacy risks for clients while ensuring that the server can mine high-quality global clustering structures. Theoretical analysis and extensive experiments demonstrate that the proposed method more effectively mitigates the trade-off between privacy protection and performance improvement compared to state-of-the-art methods.
Poster
Soheil Behnezhad · Moses Charikar · Vincent Cohen-Addad · Alma Ghafari · Weiyun ma
[ East Exhibition Hall A-B ]
Abstract
We study the classic correlation clustering problem. Given $n$ objects and a complete labeling of the object-pairs as either “similar” or “dissimilar”, the goal is to partition the objects intoarbitrarily many clusters while minimizing disagreements with the labels.A classic Pivot algorithm for this problem, due to [Ailon et al STOC'05], obtains a 3-approximation for this problem. Over the years, this algorithm has been successfully implemented in various settings. The downside of the Pivot algorithm is that the approximation analysis of 3 is tight for it. While better approximations have been achieved in some settings, these algorithms are often hard to implement in various settings. For example, [Behnezhad et al FOCS19] showed that the output of Pivot can be maintained in polylog time per update in a dynamic setting, a bound that was improved to constant by [Dalirrooyfard et al ICML'24]. But obtaining a better approximation remains open.In this paper, we present Modified Pivot, an algorithm that locally improves the output of Pivot. Our Modified Pivot algorithm can be implemented just as efficiently as Pivot in various settings. Our experiments show that the output of Modified Pivot on average makes less than 77\% of the mistakes made by Pivot. More surprisingly, …
Poster
Yazhou Ren · Junlong Ke · Zichen Wen · Tianyi Wu · Yang Yang · Xiaorong Pu · Lifang He
[ East Exhibition Hall A-B ]
Abstract
Multi-view clustering has gained significant attention for integrating multi-view information in multimedia applications. With the growing complexity of graph data, multi-view graph clustering (MVGC) has become increasingly important. Existing methods primarily use Graph Neural Networks (GNNs) to encode structural and feature information, but applying GNNs within contrastive learning poses specific challenges, such as integrating graph data with node features and handling both homophilic and heterophilic graphs. To address these challenges, this paper introduces Node-Guided Contrastive Encoding (NGCE), a novel MVGC approach that leverages node features to guide embedding generation. NGCE enhances compatibility with GNN filtering, effectively integrates homophilic and heterophilic information, and strengthens contrastive learning across views. Extensive experiments demonstrate its robust performance on six homophilic and heterophilic multi-view benchmark datasets.
Poster
Yicheng Pan · Renjie Chen · Pengyu Long · Bingchen Fan
[ East Exhibition Hall A-B ]
Abstract
Overlap and hierarchy are two prevalent phenomena in clustering, and usually coexist in a single system. There are several studies on each of them separately, but it is unclear how to characterize and evaluate the hybrid structures yet. To address this issue, we initiate the study of hierarchical overlapping clustering on graphs by introducing a new cost function for it. We show the rationality of our cost function via several intuitive properties, and develop an approximation algorithm that achieves a constant approximation factor for its dual version. Our algorithm is a recursive process of overlapping bipartition based on local search, which makes a speed-up version of it extremely scalable. Our experiments demonstrate that the speed-up algorithm has significantly better performances than all the baseline methods in both effectiveness and scalability on synthetic and real datasets.
Poster
Anne Ouyang · Simon Guo · Simran Arora · Alex Zhang · William Hu · Christopher Re · Azalia Mirhoseini
[ East Exhibition Hall A-B ]
Abstract
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce **KernelBench**, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric $\text{fast}_p$, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold $p$ over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20\% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold $p$.
Poster
Aleksandr Gushchin · Khaled Abud · Georgii Bychkov · Ekaterina Shumitskaya · Anna Chistyakova · Sergey Lavrushkin · Bader Rasheed · Kirill Malyshev · Dmitriy Vatolin · Anastasia Antsiferova
[ East Exhibition Hall A-B ]
Abstract
Modern neural-network-based Image Quality Assessment (IQA) metrics are vulnerable to adversarial attacks, which can be exploited to manipulate search engine rankings, benchmark results, and content quality assessments, raising concerns about the reliability of IQA metrics in critical applications. This paper presents the first comprehensive study of IQA defense mechanisms in response to adversarial attacks on these metrics to pave the way for safer use of IQA metrics. We systematically evaluated 30 defense strategies, including purification, training-based, and certified methods --- and applied 14 adversarial attacks in adaptive and non-adaptive settings to compare these defenses on 9 no-reference IQA metrics. Our proposed benchmark aims to guide the development of IQA defense methods and is open to submissions; the latest results and code are at https://msu-video-group.github.io/adversarial-defenses-for-iqa/.
Poster
Tianle Li · Wei-Lin Chiang · Evan Frick · Lisa Dunlap · Tianhao Wu · Banghua Zhu · Joseph E Gonzalez · Ion Stoica
[ East Exhibition Hall A-B ]
Abstract
The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.
Poster
Hao Yu · Weixuan Liang · KE LIANG · Suyuan Liu · Meng Liu · Xinwang Liu
[ East Exhibition Hall A-B ]
Abstract
Multi-kernel clustering (MKC) has emerged as a powerful method for capturing diverse data patterns, offering robust and generalized representations of data structures. However, the increasing deployment of MKC in real-world applications raises concerns about its vulnerability to adversarial perturbations. While adversarial robustness has been extensively studied in other domains, its impact on MKC remains largely unexplored. In this paper, we address the challenge of assessing the adversarial robustness of MKC methods in a black-box setting. Specifically, we propose *AdvMKC*, a novel reinforcement-learning-based adversarial attack framework designed to inject imperceptible perturbations into data and mislead MKC methods. AdvMKC leverages proximal policy optimization with an advantage function to overcome the instability of clustering results during optimization. Additionally, it introduces a generator-clusterer framework, where a generator produces adversarial perturbations, and a clusterer approximates MKC behavior, significantly reducing computational overhead. We provide theoretical insights into the impact of adversarial perturbations on MKC and validate these findings through experiments. Evaluations across seven datasets and eleven MKC methods (seven traditional and four robust) demonstrate AdvMKC's effectiveness, robustness, and transferability.
Poster
Mateusz Olko · Mateusz Gajewski · Joanna Wojciechowska · Mikołaj Morzy · Piotr Sankowski · Piotr Milos
[ East Exhibition Hall A-B ]
Abstract
Neural causal discovery methods have recently improved in terms of scalability and computational efficiency. However, our systematic evaluation highlights significant room for improvement in their accuracy when uncovering causal structures. We identify a fundamental limitation: \textit{unavoidable likelihood score estimation errors disallow distinguishing the true structure},even for small graphs and relatively large sample sizes. Furthermore, we identify the faithfulness property as a critical bottleneck: (i) it is likely to be violated across any reasonable dataset size range, and (ii) its violation directly undermines the performance of neural penalized-likelihood discovery methods. These findings lead us to conclude that progress within the current paradigm is fundamentally constrained, necessitating a paradigm shift in this domain.
Poster
Daniele Tramontano · Yaroslav Kivva · Saber Salehkaleybar · Negar Kiyavash · Mathias Drton
[ East Exhibition Hall A-B ]
Abstract
This paper investigates causal effect identification in latent variable Linear Non-Gaussian Acyclic Models (lvLiNGAM) using higher-order cumulants, addressing two prominent setups that are challenging in the presence of latent confounding: (1) a single proxy variable that may causally influence the treatment and (2) underspecified instrumental variable cases where fewer instruments exist than treatments. We prove that causal effects are identifiable with a single proxy or instrument and provide corresponding estimation methods. Experimental results demonstrate the accuracy and robustness of our approaches compared to existing methods, advancing the theoretical and practical understanding of causal inference in linear systems with latent confounders.
Poster
Nora Schneider · Lars Lorch · Niki Kilbertus · Bernhard Schölkopf · Andreas Krause
[ East Exhibition Hall A-B ]
Abstract
We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointly-estimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.
Poster
Jake Robertson · Noah Hollmann · Samuel Gabriel Müller · Noor Awad · Frank Hutter
[ East Exhibition Hall A-B ]
Abstract
Machine learning (ML) systems are utilized in critical sectors such as healthcare, law enforcement, and finance, but often rely on historical data that contains demographic biases, leading to decisions that perpetuate or intensify existing inequalities. Causal and counterfactual fairness provide a transparent, human-in-the-loop framework to mitigate algorithmic discrimination, aligning closely with legal doctrines of direct and indirect discrimination. However, current causal fairness frameworks hold a key limitation in that they assume prior knowledge of the correct causal model, restricting their applicability in complex fairness scenarios where causal models are unknown or difficult to identify. To bridge this gap, we propose FairPFN, a tabular foundation model pre-trained on synthetic causal fairness data to identify and mitigate the causal effects of protected attributes in its predictions. FairPFN's key contribution is that it requires no knowledge of the causal model and demonstrates strong performance across a diverse set of hand-crafted and real-world causal scenarios relative to robust baseline methods. FairPFN paves the way for a promising direction for future research, making causal fairness more accessible to a wider variety of complex fairness problems.
Poster
Hechuan Wen · Tong Chen · Mingming Gong · Li Kheng Chai · Shazia Sadiq · Hongzhi Yin
[ East Exhibition Hall A-B ]
Abstract
Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the post-treatment effect, e.g., the expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more high-quality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures -- factual and counterfactual covering radii determine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into the Factual and Counterfactual Coverage Maximization to ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets. Code: https://github.com/uqhwen2/FCCM.
Spotlight Poster
Shanshan Luo · Yu yixuan · Chunchen LIU · Feng Xie · zhi geng
[ East Exhibition Hall A-B ]
Abstract
Previous studies have extensively addressed the attribution problem for binary outcome variables. However, in many practical scenarios, the outcome variable is continuous, and simply binarizing it may result in information loss or biased conclusions. To address this issue, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous outcome. These estimands include posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method.
Poster
Carlota Parés Morlans · Michelle Yi · Claire Chen · Sarah A Wu · Rika Antonova · Tobias Gerstenberg · Jeannette Bohg
[ East Exhibition Hall A-B ]
Abstract
Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. These tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose Causal-PIK, a method that leverages Bayesian optimization to reason about causal interactions via a Physics-Informed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal-PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. We find that Causal-PIK remains competitive on tasks that are very challenging, even for human problem-solvers.
Poster
Kevin Xia · Elias Bareinboim
[ East Exhibition Hall A-B ]
Abstract
The study of causal abstractions bridges two integral components of human intelligence: the ability to determine cause and effect, and the ability to interpret complex patterns into abstract concepts. Formally, causal abstraction frameworks define connections between complicated low-level causal models and simple high-level ones. One major limitation of most existing definitions is that they are not well-defined when considering lossy abstraction functions in which multiple low-level interventions can have different effects while mapping to the same high-level intervention (an assumption called the abstract invariance condition). In this paper, we introduce a new type of abstractions called projected abstractions that generalize existing definitions to accommodate lossy representations. We show how to construct a projected abstraction from the low-level model and how it translates equivalent observational, interventional, and counterfactual causal queries from low to high-level. Given that the true model is rarely available in practice we prove a new graphical criteria for identifying and estimating high-level causal queries from limited low-level data. Finally, we experimentally show the effectiveness of projected abstraction models in high-dimensional image settings.
Poster
Kun Wang · Sumanth Varambally · Duncan Watson-Parris · Yian Ma · Rose Yu
[ East Exhibition Hall A-B ]
Abstract
Many important phenomena in scientific fields like climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to model latent time series and their causal relationships from spatiotemporal data. SPACY alleviates the high-dimensional challenge by discovering causal structures in the latent space. To aggregate spatially proximate, correlated grid points, we use spatial factors, parametrized by spatial kernel functions, to map observational time series to latent representations. Theoretically, we generalize the problem to a continuous spatial domain and establish identifiability when the observations arise from a nonlinear, invertible function of the product of latent series and spatial factors. Using this approach, we avoid assumptions that are often unverifiable, including those about instantaneous effects or sufficient variability. Empirically, SPACY outperforms state-of-the-art baselines on synthetic data, even in challenging settings where existing methods struggle, while remaining scalable for large grids. SPACY also identifies key known phenomena from real-world climate data. An implementation of SPACY is available at \url{https://github.com/Rose-STL-Lab/SPACY/}
Poster
He Li · Haoang Chi · Mingyu Liu · Wanrong Huang · Liyang Xu · Wenjing Yang
[ East Exhibition Hall A-B ]
Abstract
The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at this [URL](https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master).
Poster
Armin Kekić · Sergio Hernan Garrido Mejia · Bernhard Schölkopf
[ East Exhibition Hall A-B ]
Abstract
Estimating causal effects of joint interventions on multiple variables is crucial in many domains, but obtaining data from such simultaneous interventions can be challenging. Our study explores how to learn joint interventional effects using only observational data and single-variable interventions. We present an identifiability result for this problem, showing that for a class of nonlinear additive outcome mechanisms, joint effects can be inferred without access to joint interventional data. We propose a practical estimator that decomposes the causal effect into confounded and unconfounded contributions for each intervention variable. Experiments on synthetic data demonstrate that our method achieves performance comparable to models trained directly on joint interventional data, outperforming a purely observational estimator.
Poster
Minqin Zhu · Zexu Sun · Ruoxuan Xiong · Anpeng Wu · Baohong Li · Caizhi Tang · JUN ZHOU · Fei Wu · Kun Kuang
[ East Exhibition Hall A-B ]
Abstract
Uplift modeling is crucial for identifying individuals likely to respond to a treatment in applications like marketing and customer retention, but evaluating these models is challenging due to the inaccessibility of counterfactual outcomes in real-world settings.In this paper, we identify a fundamental limitation in existing evaluation metrics, such as the uplift and Qini curves, which fail to rank individuals with binary negative outcomes accurately.This can lead to biased evaluations, where biased models receive higher curve values than unbiased ones, resulting in suboptimal model selection.To address this, we propose the Principled Uplift Curve (PUC), a novel evaluation metric that assigns equal curve values of individuals with both positive and negative binary outcomes, offering a more balanced and unbiased assessment. We then derive the Principled Uplift Loss (PUL) function from the PUC and integrate it into a new uplift model, the Principled Treatment and Outcome Network (PTONet), to reduce bias during uplift model training.Experiments on both simulated and real-world datasets demonstrate that the PUC provides less biased evaluations, while PTONet outperforms existing methods. The source code is available at: https://github.com/euzmin/PUC.
Poster
Kexuan Shi · Hai Chen · Leheng Zhang · Shuhang Gu
[ East Exhibition Hall A-B ]
Abstract
Implicit Neural Representations (INRs), as a versatile representation paradigm, have achieved success in various computer vision tasks. Due to the spectral bias of the vanilla multi-layer perceptrons (MLPs), existing methods focus on designing MLPs with sophisticated architectures or repurposingtraining techniques for highly accurate INRs. In this paper, we delve into the linear dynamics model of MLPs and theoretically identify the empirical Neural Tangent Kernel (eNTK) matrix as a reliable link between spectral bias and training dynamics. Based on this insight, we propose a practical **I**nductive **G**radient **A**djustment (**IGA**) method, which could purposefully improve the spectral bias via inductive generalization of eNTK-based gradient transformation matrix. Theoretical andempirical analyses validate impacts of IGA on spectral bias. Further, we evaluate our method on different INRs tasks with various INR architectures and compare to existing training techniques. The superior and consistent improvements clearly validate the advantage of our IGA. Armed with our gradient adjustment method, better INRs with more enhanced texture details and sharpened edges can be learned from data by tailored impacts on spectral bias. The codes are available at: [https://github.com/LabShuHangGU/IGA-INR](https://github.com/LabShuHangGU/IGA-INR).
Poster
Xuanming Cui · Chionh Peng · Adriel Kuek · Ser-Nam Lim
[ East Exhibition Hall A-B ]
Abstract
Neural Theorem Provers (NTPs) present a promising framework for neuro-symbolic reasoning, combining end-to-end differentiability with the interpretability of symbolic logic programming. However, optimizing NTPs remains a significant challenge due to their complex objective landscape and gradient sparcity. On the other hand, Knowledge Graph Embedding (KGE) methods offer smooth optimization with well-defined learning objectives but often lack interpretability. In this work, we propose several strategies to integrate the strengths of NTPs and KGEs, and demonstrate substantial improvements in both accuracy and computational efficiency. Specifically, we show that by leveraging the strength of structural learning in KGEs, we can greatly improve NTPs' poorly structured embedding space, while by substituting NTPs with efficient KGE operations, we can significantly reduce evaluation time by over 1000$\times$ on large-scale dataset such as WN18RR with a mild accuracy trade-off.
Poster
KAIJUN LIU · Sijie Ruan · Liang Zhang · Cheng Long · Shuliang Wang · Liang Yu
[ East Exhibition Hall A-B ]
Abstract
Recovering human trajectories from incomplete or missing data is crucial for many mobility-based urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the best baselines.
Poster
Gaole Dai · Chun-Kai Fan · Yiming Tang · Zhi Zhang · Yuan Zhang · Yulu Gan · Qizhe Zhang · Cheng-Ching Tseng · Shanghang Zhang · Tiejun Huang
[ East Exhibition Hall A-B ]
Abstract
Advances in Parameter-efficient Fine-tuning (PEFT) bridged the performance gap with Full Fine-Tuning (FFT) through sophisticated analysis of pre-trained parameter spaces. Starting from drawing insights from Neural Engrams (NE) in Biological Neural Networks (BNNs), we establish a connection between the low-rank property observed during PEFT's parameter space shifting and neurobiological mechanisms. This observation leads to our proposed method, **S**ynapse and **N**euron (**SAN**), which decomposes and propagates the scaling component from anterior feature adjustment vectors towards posterior weight matrices. Our approach is theoretically grounded in Long-Term Potentiation/Depression (LTP/D) phenomena, which govern synapse development through neurotransmitter release modulation. Extensive experiments demonstrate its effectiveness: on **vision tasks** across VTAB, FGVC, and GIC (25 datasets) using ViT, Swin-T and ConvNeXt architectures, SAN outperforms FFT up to *8.7%* and LoRA by *3.2%*; on **language tasks** using Commonsense Reasoning (8 datasets) with LLaMA models (all generations), surpassing ChatGPT up to *8.5%* and LoRA by *4.7%*; on **vision-language tasks** using Visual Instruction Tuning (7 datasets) with LLaVA models, it exceeds FFT up to *2.4%* and LoRA by *1.9%*. Our code and W&B log will be released
Spotlight Poster
Guanghui Wang · Zhiyong Yang · Zitai Wang · Shi Wang · Qianqian Xu · Qingming Huang
[ East Exhibition Hall A-B ]
Abstract
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving a better trade-off between these effects. Extensive …
Spotlight Poster
Harikrishna Metta · Venkatesh Babu Radhakrishnan
[ East Exhibition Hall A-B ]
Abstract
In Machine learning, separating data into classes is a very fundamental problem. A mathematical framework around the classes is presented in this work to deepen the understanding of classes. The classes are defined as vectors in a Vector Space, where addition corresponds to the union of classes, and scalar multiplication resembles set complement of classes. The Zero-Vector in the vector space corresponds to a class referred to as the Metta-Class. This discovery enables numerous applications. One such application, termed 'clear learning' in this work, focuses on learning the true nature (manifold) of the data instead of merely learning a boundary sufficient for classification. Another application, called 'unary class learning', involves learning a single class in isolation rather than learning by comparing two or more classes. Additionally, 'set operations on classes' is another application highlighted in this work. Furthermore, Continual Learning of classes is facilitated by smaller networks. The Metta-Class enables neural networks to learn only the data manifold; therefore, it can also be used for generation of new data. Results for the key applications are shown using the MNIST dataset. To further strengthen the claims, some results are also produced using the CIFAR-10 and ImageNet-1k embeddings. The code supporting these …
Poster
Chao Gao · Liren Shan · Vaidehi Srinivas · Aravindan Vijayaraghavan
[ East Exhibition Hall A-B ]
Abstract
Conformal Prediction is a widely studied technique to construct prediction sets of future observations. Most conformal prediction methods focus on achieving the necessary coverage guarantees, but do not provide formal guarantees on the size (volume) of the prediction sets. We first prove the impossibility of volume optimality where any distribution-free method can only find a trivial solution. We then introduce a new notion of volume optimality by restricting the prediction sets to belong to a set family (of finite VC-dimension), specifically a union of $k$-intervals. Our main contribution is an efficient distribution-free algorithm based on dynamic programming (DP) to find a union of $k$-intervals that is guaranteed for any distribution to have near-optimal volume among all unions of $k$-intervals satisfying the desired coverage property. By adopting the framework of distributional conformal prediction (Chernozhukov et al., 2021), the new DP based conformity score can also be applied to achieve approximate conditional coverage and conditional restricted volume optimality, as long as a reasonable estimator of the conditional CDF is available. While the theoretical results already establish volume-optimality guarantees, they are complemented by experiments that demonstrate that our method can significantly outperform existing methods in many settings.
Poster
Gabriel Thompson · Kai Yue · Chau-Wai Wong · Huaiyu (David) Dai
[ East Exhibition Hall A-B ]
Abstract
Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess data of different distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-client model deviation and improves both accuracy and convergence in heterogeneous settings. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalization. Source code for NTK-DFL is available at https://github.com/Gabe-Thomp/ntk-dfl}{https://github.com/Gabe-Thomp/ntk-dfl
Poster
Prakash Palanivelu Rajmohan · Fred Roosta
[ East Exhibition Hall A-B ]
Abstract
While norm-based and leverage-score-based methods have been extensively studied for identifying "important" data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.
Poster
Henrik von Kleist · Joshua Wendland · Ilya Shpitser · Carsten Marr
[ East Exhibition Hall A-B ]
Abstract
Feature importance metrics are critical for interpreting machine learning models and understanding the relevance of individual features. However, real-world data often exhibit missingness, thereby complicating how feature importance should be evaluated. We introduce the distinction between two evaluation frameworks under missing data: (1) feature importance under the full data, as if every feature had been fully measured, and (2) feature importance under the observed data, where missingness is governed by the current measurement policy. While the full data perspective offers insights into the data generating process, it often relies on unrealistic assumptions and cannot guide decisions when missingness persists at model deployment. Since neither framework directly informs improvements in data collection, we additionally introduce the feature measurement importance gradient (FMIG), a novel, model-agnostic metric that identifies features that should be measured more frequently to enhance predictive performance. Using synthetic data, we illustrate key differences between these metrics and the risks of conflating them.
Poster
Sida Li · Nikolaos Ignatiadis
[ East Exhibition Hall A-B ]
Abstract
Prediction-Powered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI’s benefits for individual statistical problems, modern applications require answering numerous parallel statistical questions. We introduce Prediction-Powered Adaptive Shrinkage ($\texttt{PAS}$), a method that bridges PPI with empirical Bayes shrinkage to improve estimation of multiple means. $\texttt{PAS}$ debiases noisy ML predictions $\textit{within}$ each task and then borrows strength $\textit{across}$ tasks by using those same predictions as a reference point for shrinkage. The amount of shrinkage is determined by minimizing an unbiased estimate of risk, and we prove that this tuning strategy is asymptotically optimal. Experiments on both synthetic and real-world datasets show that $\texttt{PAS}$ adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications.
Poster
Jonas Schweisthal · Dennis Frauen · Maresa Schröder · Konstantin Hess · Niki Kilbertus · Stefan Feuerriegel
[ East Exhibition Hall A-B ]
Abstract
Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).
Poster
Rickard K.A. Karlsson · Jesse H. Krijthe
[ East Exhibition Hall A-B ]
Abstract
A major challenge in estimating treatment effects in observational studies is the reliance on untestable conditions such as the assumption of no unmeasured confounding. In this work, we propose an algorithm that can falsify the assumption of no unmeasured confounding in a setting with observational data from multiple heterogeneous sources, which we refer to as environments. Our proposed falsification strategy leverages a key observation that unmeasured confounding can cause observed causal mechanisms to appear dependent. Building on this observation, we develop a novel two-stage procedure that detects these dependencies with high statistical power while controlling false positives. The algorithm does not require access to randomized data and, in contrast to other falsification approaches, functions even under transportability violations when the environment has a direct effect on the outcome of interest. To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and semi-synthetic data.
Poster
Xi Chen · Yateng Tang · Jiarong Xu · Jiawei Zhang · Siwei Zhang · Sijia Peng · Xuehao Zheng · Yun Xiong
[ East Exhibition Hall A-B ]
Abstract
Effectively modeling time information and incorporating it into applications or models involving chronologically occurring events is crucial. Real-world scenarios often involve diverse and complex time patterns, which pose significant challenges for time encoding methods. While previous methods focus on capturing time patterns, many rely on specific inductive biases, such as using trigonometric functions to model periodicity. This narrow focus on single-pattern modeling makes them less effective in handling the diversity and complexities of real-world time patterns. In this paper, we investigate to improve the existing commonly used time encoding methods and introduce **Learnable Transformation-based Generalized Time Encoding (LeTE)**. We propose using deep function learning techniques to parameterize nonlinear transformations in time encoding, making them learnable and capable of modeling generalized time patterns, including diverse and complex temporal dynamics. By enabling learnable transformations, LeTE encompasses previous methods as specific cases and allows seamless integration into a wide range of tasks. Through extensive experiments across diverse domains, we demonstrate the versatility and effectiveness of LeTE.
Poster
Gaurav Menghani · Ravi Kumar · Sanjiv Kumar
[ East Exhibition Hall A-B ]
Abstract
One of the core pillars of efficient deep learning methods are architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformer-based architectures, the backbone of LLMs.In this paper, we introduce the Learned Augmented Residual Layer (LAuReL) --- a novel generalization of the canonical residual connection --- designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count.For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2.6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54\% to 20.05\%, while adding only 0.012\% and 0.1\% additional parameters, respectively.
Poster
David Boetius · Stefan Leue · Tobias Sutter
[ East Exhibition Hall A-B ]
Abstract
Probabilistic verification problems of neural networks are concerned with formally analysing the output distribution of a neural network under a probability distribution of the inputs. Examples of probabilistic verification problems include verifying the demographic parity fairness notion or quantifying the safety of a neural network. We present a new algorithm for solving probabilistic verification problems of neural networks based on an algorithm for computing and iteratively refining lower and upper bounds on probabilities over the outputs of a neural network. By applying state-of-the-art bound propagation and branch and bound techniques from non-probabilistic neural network verification, our algorithm significantly outpaces existing probabilistic verification algorithms, reducing solving times for various benchmarks from the literature from tens of minutes to tens of seconds. Furthermore, our algorithm compares favourably even to dedicated algorithms for restricted probabilistic verification problems. We complement our empirical evaluation with a theoretical analysis, proving that our algorithm is sound and, under mildly restrictive conditions, also complete when using a suitable set of heuristics.
Poster
Piotr Kubaty · Bartosz Wójcik · Bartłomiej Krzepkowski · Monika Michaluk · Tomasz Trzcinski · Jary Pomponi · Kamil Adamczewski
[ East Exhibition Hall A-B ]
Abstract
Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network. Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately. However, the implications of this choice are often overlooked, with studies typically adopting one approach without adequate justification. This choice influences training dynamics and its impact remains largely unexplored. In this paper, we introduce a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy. We demonstrate that conventionally used joint and disjoint regimes yield suboptimal performance. To address these limitations, we propose a mixed training strategy: the backbone is trained first, followed by the training of the entire multi-exit network. Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods we present strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy.
Poster
Artur Back de Luca · George Giapitzakis · Shenghao Yang · Petar Veličković · Kimon Fountoulakis
[ East Exhibition Hall A-B ]
Abstract
There is a growing interest in the ability of neural networks to execute algorithmic tasks (e.g., arithmetic, summary statistics, and sorting).The goal of this work is to better understand the role of attention in Transformers for algorithmic execution. Its importance for algorithmic execution has been studied theoretically and empirically using parallel computational models. Notably, many parallel algorithms communicate between processors solely using positional information. Inspired by this observation, we investigate how Transformers can execute algorithms using positional attention, where attention weights depend exclusively on positional encodings. We prove that Transformers with positional attention (positional Transformers) maintain the same expressivity of parallel computational models, incurring a logarithmic depth cost relative to the input length. We analyze their in-distribution learnability and explore how parameter norms in positional attention affect sample complexity. Our results show that positional Transformers introduce a learning trade-off: while they exhibit better theoretical dependence on parameter norms, certain tasks may require more layers, which can, in turn, increase sample complexity. Finally, we empirically explore the out-of-distribution performance of positional Transformers and find that they perform well in tasks where their underlying algorithmic solution relies on positional information.
Poster
Yinbin Han · Meisam Razaviyayn · Renyuan Xu
[ East Exhibition Hall A-B ]
Abstract
Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback–Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a {policy iteration algorithm (PI-FT)} for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm preserve the desired regularity. Finally, we extend our framework to parametric settings for efficient implementation and demonstrate the practical effectiveness of the proposed PI-FT algorithm through numerical experiments.
Poster
Alexander Atanasov · Jacob A Zavatone-Veth · Cengiz Pehlevan
[ East Exhibition Hall A-B ]
Abstract
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
Poster
Freya Behrens · Luca Biggio · Lenka Zdeborová
[ East Exhibition Hall A-B ]
Abstract
Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.
Poster
Petar Veličković · Christos Perivolaropoulos · Federico Barbero · Razvan Pascanu
[ East Exhibition Hall A-B ]
Abstract
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
Poster
Ruiqi Zhang · Jingfeng Wu · Peter Bartlett
[ East Exhibition Hall A-B ]
Abstract
We analyze the convergence of gradient descent (GD) with large, adaptive stepsizes for logistic regression on linearly separable data. The stepsize adapts to the current risk, scaled by a fixed base stepsize \eta. We prove that once the number of iterates t surpasses a margin-dependent threshold, the averaged GD iterate achieves a risk upper bound of \exp(-\Theta(\eta t)), where \eta can be chosen arbitrarily large. This implies that GD attains \emph{arbitrarily fast} convergence rates via large stepsizes, although the risk evolution might not be monotonic. In contrast, prior adaptive stepsize GD analyses require a monotonic risk decrease, limiting their rates to \exp(-\Theta(t)). We further establish a margin-dependent lower bound on the iteration complexity for any first-order method to attain a small risk, justifying the necessity of the burn-in phase in our analysis. Our results generalize to a broad class of loss functions and two-layer networks under additional assumptions.
Poster
Zixiang Chen · Greg Yang · Qingyue Zhao · Quanquan Gu
[ East Exhibition Hall A-B ]
Abstract
Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
Poster
Javan Tahir · Surya Ganguli · Grant Rotskoff
[ East Exhibition Hall A-B ]
Abstract
With the emergence of large-scale pre-trained neural networks, methods to adapt such "foundation" models to data-limited downstream tasks have become a necessity.Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of ``task similarity'' is still lacking. We adopt a \emph{feature-centric} viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch.We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap.For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.
Poster
Ningyuan Huang · Miguel Sarabia · Abhinav Moudgil · Pau Rodriguez · Luca Zappella · Federico Danieli
[ East Exhibition Hall A-B ]
Abstract
State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its block definition. While these modifications do improve Mamba's performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memorization, and associative recall capabilities.In particular: (i) we prove that the S6 layer of Mamba can represent projections onto *Haar wavelets*, providing an edge over its Diagonal SSM (S4D) predecessor in approximating discontinuous functions commonly arising in practice; (ii) we show how the S6 layer can dynamically counteract memory decay; (iii) we provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers --- Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.
Poster
Alaa Anani · Tobias Lorenz · Mario Fritz · Bernt Schiele
[ East Exhibition Hall A-B ]
Abstract
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at [https://github.com/AlaaAnani/certified-attributions](https://github.com/AlaaAnani/certified-attributions).
Poster
Wenwen He · Wenke Huang · Bin Yang · ShuKan Liu · Mang Ye
[ East Exhibition Hall A-B ]
Abstract
Federated Learning (FL) enables collaborative training with privacy preservation but is vulnerable to backdoor attacks, where malicious clients degrade model performance on targeted inputs. These attacks exploit FL decentralized nature, while existing defenses, based on isolated behaviors and fixed rules, can be bypassed by adaptive attackers. To address these limitations, we propose **SPMC**, a marginal collaboration defense mechanism that leverages intrinsic consistency across clients to estimate inter-client marginal contributions. This allows the system to dynamically reduce the influence of clients whose behavior deviates from the collaborative norm, thus maintaining robustness even as the number of attackers changes. In addition to overcoming proxy-dependent purification's weaknesses, we introduce a self-purification process that locally adjusts suspicious gradients. By aligning them with margin-based model updates, we mitigate the effect of local poisoning. Together, these two modules significantly improve the adaptability and resilience of FL systems, both at the client and server levels. Experimental results on a variety of classification benchmarks demonstrate that SPMC achieves strong defense performance against sophisticated backdoor attacks without sacrificing accuracy on benign tasks. The code is posted at: https://github.com/WenddHe0119/SPMC.
Poster
Enes Altinisik · Safa Messaoud · Husrev Taha Sencar · Hassan Sajjad · Sanjay Chawla
[ East Exhibition Hall A-B ]
Abstract
Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoder-based models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers.We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization.We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25–33% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups, demonstrating superior robustness with comparable generalization to standard training.
Poster
Sen Peng · Mingyue Wang · Jianfei He · Jijia Yang · Xiaohua Jia
[ East Exhibition Hall A-B ]
Abstract
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners.Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them.In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments.We then propose the Contrastive Adversarial Training (CAT) utilizing lightweight adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization, urging the community to reconsider and improve the robustness of existing protective perturbations. The code is available at \url{https://github.com/senp98/CAT}.
Poster
EMANUELE SANSONE · Tim Lebailly · Tinne Tuytelaars
[ East Exhibition Hall A-B ]
Abstract
We present a principled and simplified design of the projector and loss function for non-contrastive self-supervised learning based on hyperdimensional computing. We theoretically demonstrate that this design introduces an inductive bias that encourages representations to be simultaneously decorrelated and clustered, without explicitly enforcing these properties. This bias provably enhances generalization and suffices to avoid known training failure modes, such as representation, dimensional, cluster, and intracluster collapses. We validate our theoretical findings on image datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet-100. Our approach effectively combines the strengths of feature decorrelation and cluster-based self-supervised learning methods, overcoming training failure modes while achieving strong generalization in clustering and linear classification tasks.
Poster
Jun Chen · Hong Chen · Yonghua Yu · Yiming Ying
[ East Exhibition Hall A-B ]
Abstract
In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as $512, 1024$ in our experiments), data inflation, weak augmentation, and SVD to …
Poster
Thalles Silva · Helio Pedrini · Adín Ramírez Rivera
[ East Exhibition Hall A-B ]
Abstract
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
Poster
Peiyuan Liu · Beiliang Wu · Yifan Hu · Naiqi Li · Tao Dai · Jigang Bao · Shutao Xia
[ East Exhibition Hall A-B ]
Abstract
Non-stationarity poses significant challenges for multivariate time series forecasting due to the inherent short-term fluctuations and long-term trends that can lead to spurious regressions or obscure essential long-term relationships. Most existing methods either eliminate or retain non-stationarity without adequately addressing its distinct impacts on short-term and long-term modeling. Eliminating non-stationarity is essential for avoiding spurious regressions and capturing local dependencies in short-term modeling, while preserving it is crucial for revealing long-term cointegration across variates. In this paper, we propose TimeBridge, a novel framework designed to bridge the gap between non-stationarity and dependency modeling in long-term time series forecasting. By segmenting input series into smaller patches, TimeBridge applies Integrated Attention to mitigate short-term non-stationarity and capture stable dependencies within each variate, while Cointegrated Attention preserves non-stationarity to model long-term cointegration across variates. Extensive experiments show that TimeBridge consistently achieves state-of-the-art performance in both short-term and long-term forecasting. Additionally, TimeBridge demonstrates exceptional performance in financial forecasting on the CSI 500 and S&P 500 indices, further validating its robustness and effectiveness. Code is available at https://github.com/Hank0626/TimeBridge.
Poster
Yitian Zhang · Liheng Ma · Antonios Valkanas · Boris Oreshkin · Mark Coates
[ East Exhibition Hall A-B ]
Abstract
Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a space of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance. Our code is available at: https://github.com/networkslab/SKOLR.
Poster
Marten Lienen · Abdullah Saydemir · Stephan Günnemann
[ East Exhibition Hall A-B ]
Abstract
State space models are emerging as a dominant model class for sequence problems with many relying on the HiPPO framework to initialize their dynamics. However, HiPPO fundamentally assumes data to be noise-free; an assumption often violated in practice. We extend the HiPPO theory with measurement noise and derive an uncertainty-aware initialization for state space model dynamics. In our analysis, we interpret HiPPO as a linear stochastic control problem where the data enters as a noise-free control signal. We then reformulate the problem so that the data become noisy outputs of a latent system and arrive at an alternative dynamics initialization that infers the posterior of this latent system from the data without increasing runtime. Our experiments show that our initialization improves the resistance of state-space models to noise both at training and inference time.
Poster
Sungwon Han · Seungeon Lee · MEEYOUNG CHA · Sercan Arik · Jinsung Yoon
[ East Exhibition Hall A-B ]
Abstract
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model's capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
Poster
Luca Masserano · Abdul Fatir Ansari · Boran Han · Xiyuan Zhang · Christos Faloutsos · Michael Mahoney · Andrew Wilson · Youngsuk Park · Syama Sundar Yadav Rangapuram · Danielle Maddix · Yuyang Wang
[ East Exhibition Hall A-B ]
Abstract
How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) performs on par or better than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and is competitive with modern deep learning models trained specifically on each dataset; ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics; and iii) easily captures complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, …
Poster
Grigory Bartosh · Dmitry Vetrov · Christian Andersson Naesseth
[ East Exhibition Hall A-B ]
Abstract
The Latent Stochastic Differential Equation (SDE) is a powerful tool for time series and sequence modeling. However, training Latent SDEs typically relies on adjoint sensitivity methods, which depend on simulation and backpropagation through approximate SDE solutions, which limit scalability. In this work, we propose SDE Matching, a new simulation-free method for training Latent SDEs. Inspired by modern Score- and Flow Matching algorithms for learning generative dynamics, we extend these ideas to the domain of stochastic dynamics for time series modeling, eliminating the need for costly numerical simulations. Our results demonstrate that SDE Matching achieves performance comparable to adjoint sensitivity methods while drastically reducing computational complexity.
Spotlight Poster
Bo-Han Lai · Pin-Han Huang · Bo-Han Kung · Shang-Tse Chen
[ East Exhibition Hall A-B ]
Abstract
Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet — a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at [GitHub Link](https://github.com/ntuaislab/BRONet).
Poster
Wendong Zheng · Junyang Chen · Husheng Guo · Wenjian Wang
[ East Exhibition Hall A-B ]
Abstract
Recently, the potential of lightweight models for resource-constrained scenarios has garnered significant attention, particularly in safety-critical tasks such as bio-electrical signal classification and B-ultrasound-assisted diagnostic. These tasks are frequently affected by environmental noise due to patient movement artifacts and inherent device noise, which pose significant challenges for lightweight models (e.g., deep binary neural networks (DBNNs)) to perform robust inference. A pertinent question arises: can a well-trained DBNN effectively resist environmental noise during inference? In this study, we find that the DBNN's robustness vulnerability comes from the binary weights and scaling factors. Drawing upon theoretical insights, we propose L1-infinite norm constraints for binary weights and scaling factors, which yield a tighter upper bound compared to existing state-of-the-art (SOTA) methods. Finally, visualization studies show that our approach introduces minimal noise perturbations at the periphery of the feature maps. Our approach outperforms the SOTA method, as validated by several experiments conducted on the bio-electrical and image classification datasets. We hope our findings can raise awareness among researchers about the environmental noise robustness of DBNNs.
Poster
Aryan Gulati · Brando Miranda · Eric Chen · Emily Xia · Kai Fronsdal · Bruno de Moraes Dumont · Sanmi Koyejo
[ East Exhibition Hall A-B ]
Abstract
Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination.We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants.The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed.On the Original set, OpenAI's o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations.The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals.These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations.Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs.Data and evaluation code are publicly available athttps://github.com/brando90/putnam-axiom.
Poster
Nguyen Nhat Minh To · Paul Wilson · Viet Nguyen · Mohamed Harmanani · Michael Cooper · Fahimeh Fooladgar · Purang Abolmaesumi · Parvin Mousavi · Rahul G. Krishnan
[ East Exhibition Hall A-B ]
Abstract
Subpopulation shift, characterized by a disparity in subpopulation distribution between the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop.
Poster
Yanxiang Ma · Zixuan Huang · Minjing Dong · Shan You · Chang Xu
[ East Exhibition Hall A-B ]
Abstract
Random defense represents a promising strategy to protect neural networks from adversarial attacks. Most of these methods enhance robustness by injecting randomness into the data, increasing uncertainty for attackers.However, this randomness could reduce the generalization capacity of defense, as defense performance could be sensitive to the hyperparameters of noise added to the data, making it difficult to generalize across different datasets. Additionally, the involvement of randomness always comes with a reduction of natural accuracy, which leads to a delicate trade-off between them, which is seldom studied in random defense. In this work, we propose incorporating randomness into the network structure instead of data input by designing stochastic deformable convolution, where a random mask replaces the convolutional offset. This process promotes data independence, enhancing generalization across datasets. To study the trade-off, we conduct a theoretical analysis of both robust and clean accuracy, from a perspective of gradient cosine similarity and natural inference. Based on the analysis, we reformulate the adversarial training in our random defense framework. Extensive experiments show that our method achieves SOTA adversarial robustness and clean accuracy compared with other random defense methods.
Poster
Shizhan Gong · Yankai Jiang · DOU QI · Farzan Farnia
[ East Exhibition Hall A-B ]
Abstract
Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance. The code and models are available at https://github.com/peterant330/KUEA.
Poster
Vladimir Zaigrajew · Hubert Baniecki · Przemysław Biecek
[ East Exhibition Hall A-B ]
Abstract
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern large-scale systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining 80\% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.
Poster
Haorun Cai · Han-Jia Ye
[ East Exhibition Hall A-B ]
Abstract
Deep tabular models have demonstrated remarkable success on i.i.d. data, excelling in a variety of structured data tasks. However, their performance often deteriorates under temporal distribution shifts, where trends and periodic patterns are present in the evolving data distribution over time.In this paper, we explore the underlying reasons for this failure in capturing temporal dependencies. We begin by investigating the training protocol, revealing a key issue in how the data is split for model training and validation.While existing approaches typically use temporal ordering for splitting, we show that even a random split significantly improves model performance. By accounting for reducing training lag and validation bias to achieve better generalization ability, our proposed splitting protocol offers substantial improvements across a variety of methods.Furthermore, we analyses how temporal data affects deep tabular representations, uncovering that these models often fail to capture crucial periodic and trend information. To address this gap, we introduce a plug-and-play temporal embedding based on Fourier series expansion to learn and incorporate temporal patterns, offering an adaptive approach to handle temporal shifts.Our experiments demonstrate that this temporal embedding, combined with the improved splitting strategy, provides a more effective and robust framework for learning from temporal tabular data.
Poster
Jinwu Hu · Zitian Zhang · Guohao Chen · Xutao Wen · Chao Shuai · Wei Luo · Bin Xiao · Yuanqing Li · Mingkui Tan
[ East Exhibition Hall A-B ]
Abstract
While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on …
Spotlight Poster
Yuchen Zeng · Tuan Dinh · Wonjun Kang · Andreas Mueller
[ East Exhibition Hall A-B ]
Abstract
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2× speedup compared to TabPFN and a 1.5× speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.
Poster
Laure Ciernik · Lorenz Linhardt · Marco Morik · Jonas Dippel · Simon Kornblith · Lukas Muttenthaler
[ East Exhibition Hall A-B ]
Abstract
The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models (Huh et al., 2024). Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is a crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for analyzing similarities of model representations across datasets and linking those similarities to differences in task behavior.
Poster
Daniel Eftekhari · Vardan Papyan
[ East Exhibition Hall A-B ]
Abstract
The normal distribution plays a central role in information theory – it is at the same time the best-case signal and worst-case noise distribution, has the greatest representational capacity of any distribution, and offers an equivalence between uncorrelatedness and independence for joint distributions. Accounting for the mean and variance of activations throughout the layers of deep neural networks has had a significant effect on facilitating their effective training, but seldom has a prescription for precisely what distribution these activations should take, and how this might be achieved, been offered. Motivated by the information-theoretic properties of the normal distribution, we address this question and concurrently present normality normalization: a novel normalization layer which encourages normality in the feature representations of neural networks using the power transform and employs additive Gaussian noise during training. Our experiments comprehensively demonstrate the effectiveness of normality normalization, in regards to its generalization performance on an array of widely used model and dataset combinations, its strong performance across various common factors of variation such as model width, depth, and training minibatch size, its suitability for usage wherever existing normalization layers are conventionally used, and as a means to improving model robustness to random perturbations.
Poster
Daria Lioubashevski · Tomer Schlank · Gabriel Stanovsky · Ariel Goldstein
[ East Exhibition Hall A-B ]
Abstract
Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the “saturation event”. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.
Poster
Tianyu Gao · Alexander Wettig · Luxi He · Yihe Dong · Sadhika Malladi · Danqi Chen
[ East Exhibition Hall A-B ]
Abstract
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia.org to reduce harmful generations or factquizmaster.com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, …
Poster
Gaurush Hiranandani · Haolun Wu · Subhojyoti Mukherjee · Sanmi Koyejo
[ East Exhibition Hall A-B ]
Abstract
Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to Plugin model -- an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models. We provide our code at this https URL.
Poster
Jiayi Pan · Xingyao Wang · Graham Neubig · Navdeep Jaitly · Heng Ji · Alane Suhr · Yizhe Zhang
[ East Exhibition Hall A-B ]
Abstract
We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.
Poster
Alina Shutova · Vladimir Malinovskii · Vage Egiazarian · Denis Kuznedelev · Denis Mazur · Surkov Nikita · Ivan Ermakov · Dan Alistarh
[ East Exhibition Hall A-B ]
Abstract
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key \& Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys \& Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU …
Poster
Ruida Wang · Rui Pan · Yuxin Li · Jipeng Zhang · Yizhen Jia · Shizhe Diao · Renjie Pi · Junjie Hu · Tong Zhang
[ East Exhibition Hall A-B ]
Abstract
Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
Spotlight Poster
Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite
[ East Exhibition Hall A-B ]
Abstract
Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability---which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the *lookup-table mechanism* for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models.We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
Poster
Ang Lv · Ruobing Xie · Yining Qian · Songhao Wu · Xingwu Sun · Zhanhui Kang · Di Wang · Rui Yan
[ East Exhibition Hall A-B ]
Abstract
Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and learning. To address this, we propose Autonomy-of-Expert (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
Poster
Xingwu Sun · Shuaipeng Li · Ruobing Xie · Weidong Han · Kan Wu · Zhen Yang · Yixing Li · An Wang · SHUAI LI · Jinbao Xue · Yu Cheng · Yangyu Tao · Zhanhui Kang · Cheng-Zhong Xu · Di Wang · Jie Jiang
[ East Exhibition Hall A-B ]
Abstract
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it's research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models.In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate …
Poster
Rishabh Tiwari · Haocheng Xi · Aditya Tomar · Coleman Hooper · Sehoon Kim · Maxwell Horton · Mahyar Najibi · Michael Mahoney · Kurt Keutzer · Amir Gholaminejad
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90\%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.
Poster
Jan Ludziejewski · Maciej Pióro · Jakub Krajewski · Maciej Stefaniak · Michał Krutul · Jan Małaśnicki · Marek Cygan · Piotr Sankowski · Kamil Adamczewski · Piotr Milos · Sebastian Jaszczur
[ East Exhibition Hall A-B ]
Abstract
Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. Extensive empirical validation confirms the theoretical predictions of our scaling laws. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.
Poster
Haoqi Wang · Tong Zhang · Mathieu Salzmann
[ East Exhibition Hall A-B ]
Abstract
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released at https://github.com/haoqiwang/singular_defect.
Poster
Tushar Aggarwal · Swayam Singh · Abhijeet Awasthi · Aditya Kanade · Nagarajan Natarajan
[ East Exhibition Hall A-B ]
Abstract
Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code …
Spotlight Poster
Chunhui Zhang · Zhongyu Ouyang · Kwonjoon Lee · Nakul Agarwal · Sean Houlihan · Soroush Vosoughi · Shao-Yuan Lo
[ East Exhibition Hall A-B ]
Abstract
Theory-of-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference with human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.
Poster
Jon Saad-Falcon · Adrian Lafuente · Shlok Natarajan · Nahum Maru · Hristo Todorov · Etash Guha · Estefany Kelly Buchanan · Mayee Chen · Neel Guha · Christopher Re · Azalia Mirhoseini
[ East Exhibition Hall A-B ]
Abstract
Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs,Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.
Spotlight Poster
Shaokun Zhang · Ming Yin · Jieyu Zhang · Jiale Liu · Zhiguang Han · Jingyang Zhang · Beibin Li · Chi Wang · Huazheng Wang · Yiran Chen · Qingyun Wu
[ East Exhibition Hall A-B ]
Abstract
Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems.To support this initiative, we introduce the Who\&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps.Using the Who\&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5\% accuracy in identifying failure-responsible agents but only 14.2\% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available in https://github.com/mingyin1/Agents_Failure_Attribution.
Poster
Luyang Liu · Jonas Pfeiffer · Jiaxing Wu · Jun Xie · Arthur Szlam
[ East Exhibition Hall A-B ]
Abstract
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation …
Poster
Heng Dong · Kefei Duan · Chongjie Zhang
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework’s generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs’ intrinsic knowledge to advance decision-making capabilities in multi-step environments.
Poster
Wen Lai · Alexander Fraser · Ivan Titov
[ East Exhibition Hall A-B ]
Abstract
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.
Poster
Amirhossein Kazemnejad · Milad Aghajohari · Eva Portelance · Alessandro Sordoni · Siva Reddy · Aaron Courville · Nicolas Le Roux
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.
Poster
Simone Bombari · Marco Mondelli
[ East Exhibition Hall A-B ]
Abstract
Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness.In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive *core* feature $x$ and a *spurious* feature $y$. Specifically, we quantify the amount of spurious correlations $\mathcal C$ learned via linear regression, in terms of the data covariance and the strength $\lambda$ of the ridge regularization.As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $\mathcal C$ and the in-distribution test loss $\mathcal L$, by showing that the value of $\lambda$ that minimizes $\mathcal L$ lies in an interval where $\mathcal C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression.Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.
Poster
Haebin Shin · Lei Ji · Xiao Liu · Yeyun Gong
[ East Exhibition Hall A-B ]
Abstract
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
Poster
Yaoxiang Wang · Haoling Li · Xin Zhang · Jie Wu · Xiao Liu · Wenxiang Hu · Zhongxin Guo · Yangyu Huang · Ying Xin · Yujiu Yang · Jinsong Su · Qi Chen · Scarlett Li
[ East Exhibition Hall A-B ]
Abstract
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available.
Poster
Yuxiao Wen
[ East Exhibition Hall A-B ]
Abstract
In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is the size of the combinatorial decisions and $\alpha$ is the independence number of $G$. This result interpolates between the known regrets $\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal.
Poster
Li Shen · Anke Tang · Yong Luo · Tao Sun · Han Hu · Xiaochun Cao
[ East Exhibition Hall A-B ]
Abstract
Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model's performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50\% sparsity, it reduces perplexity by 53.9\% compared …
Poster
Ravi Ghadia · Avinash Kumar · Gaurav Jain · Prashant J. Nair · Poulami Das
[ East Exhibition Hall A-B ]
Abstract
Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias.We propose ${MorphKV}$, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, which is crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9\% memory savings and 18.2\% higher accuracy on average compared to state-of-the-art prior works, enabling efficient deployment.
Poster
Yuxiao Qu · Matthew Yang · Amrith Setlur · Lewis Tunstall · Edward Beeching · Russ Salakhutdinov · Aviral Kumar
[ East Exhibition Hall A-B ]
Abstract
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for …
Poster
Cheryl Li · Tianyuan Xu · Yiwen Guo
[ East Exhibition Hall A-B ]
Abstract
Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical computation, which has somehow led to the development of program-aided techniques.Despite their potential, a persistent challenge remains: inconsistencies between LLM-reported reasoning steps and the logic in generated programs, which we term ``reasoning hallucinations." This stems from the inherent ambiguities of NL and the statistical nature of LLMs, which often lack rigorous logical coherence.To address this challenge, we propose a novel test-time scaling framework, Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions.By decomposing the initially generated program into discrete units using static analysis, RaLU engages in an iterative dialogue with the LLM to judge, refine, and explain each unit.A rewind-and-correct mechanism ensures alignment between code statements and task requirements in each unit, ultimately forming a cohesive reasoning path under the program's logic, from which the model reaches a final solution.Our experiments demonstrate that RaLU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (HumanEval+, MBPP+), underscoring its potential to advance LLM …
Poster
Zongqian Wu · Baoduo Xu · Ruochen Cui · Mengmeng Zhan · Xiaofeng Zhu · Lei Feng
[ East Exhibition Hall A-B ]
Abstract
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, i.e., over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments show that the proposed method achieves significant advantages in both performance and computational efficiency. Our code is available at: https://github.com/zongqianwu/ST-COT.
Poster
Zhuoran Zhang · Yongxiang Li · Zijian Kan · Keyuan Cheng · Lijie Hu · Di Wang
[ East Exhibition Hall A-B ]
Abstract
The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve knowledge with implicit subject information from deeper MLP layers, unlike single-hop tasks, which rely on shallow layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers with single-hop edit prompts, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. Beyond single-hop editing prompts, IFMET further incorporates multi-hop editing prompts to locate and modify knowledge across different stages of reasoning. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, overcoming the limitations of previous locate-then-edit methods.
Poster
Loris Gaven · Thomas Carta · Clément Romac · Cédric Colas · sylvain lamprier · Olivier Sigaud · Pierre-Yves Oudeyer
[ East Exhibition Hall A-B ]
Abstract
Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one’s own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and learning progress online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.
Poster
Ananth Balashankar · Ziteng Sun · Jonathan Berant · Jacob Eisenstein · Michael Collins · Adrian Hutter · Jong Lee · Chirag Nagpal · Flavien Prost · Aradhana Sinha · Ananda Suresh · Ahmad Beirami
[ East Exhibition Hall A-B ]
Abstract
Language model alignment is a critical stepin training modern generative language models.Alignment targets to improve win rate of a samplefrom the aligned model against the base model.Today, we are increasingly using inference-timealgorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language modelsrather than standard sampling. We show that thistrain/test mismatch makes standard RLHF framework sub-optimal in view of such inference-timemethods. To this end, we propose a framework forinference-aware alignment (InfAlign), whichaims to optimize *inference-time win rate* of thealigned policy against the base model. We provethat for any inference-time decoding procedure,the optimal aligned policy is the solution to thestandard RLHF problem with a *transformation*of the reward. This motivates us to provide thecalibrate-and-transform RL (InfAlign-CTRL)algorithm to solve this problem, which involvesa reward calibration step and a KL-regularizedreward maximization step with a transformationof the calibrated reward. For best-of-$N$ samplingand best-of-$N$ jailbreaking, we propose specifictransformations offering up to 3-8% improvementon inference-time win rates. Finally, we also showthat our proposed reward calibration method is astrong baseline for optimizing standard win rate.
Poster
Franck TALLA · Edouard Grave · Herve Jegou
[ East Exhibition Hall A-B ]
Abstract
We address the problem of extending a pre-trained large language model to a new domain that was not seen during training. Standard techniques, such as fine-tuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain.Here, we propose to revisit and improve adapters to extend LLMs. Our paper analyzes this extension problem from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads to each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperforms competing approaches such as fine-tuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.
Poster
Zicheng Lin · Tian Liang · Jiahao Xu · Qiuzhi Liu · Xing Wang · Ruilin Luo · Chufan Shi · Siheng Li · Yujiu Yang · Zhaopeng Tu
[ East Exhibition Hall A-B ]
Abstract
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept of critical tokens -- elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling and demonstrate their substantial divergence from traditional error tokens. Through extensive experiments on datasets such as GSM8K and MATH500, we show that identifying and replacing critical tokens significantly improves model accuracy. We propose an efficient methodology for pinpointing these tokens in large-scale datasets using contrastive estimation and extend this framework to enhance model training processes with direct preference optimization (DPO). Experimental results on GSM8K and MATH500 benchmarks with the widely used models Llama-3 (8B and 70B) and Deepseek-math (7B) demonstrate the effectiveness of the proposed approach, cDPO. Our results underscore the potential of leveraging critical tokens to reduce errors in reasoning tasks, advancing the development of AI systems capable of robust logical deduction.
Poster
Valentyn Boreiko · Alexander Panfilov · Václav Voráček · Matthias Hein · Jonas Geiping
[ East Exhibition Hall A-B ]
Abstract
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods.Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.
Poster
Jingyu Liu · Beidi Chen · Ce Zhang
[ East Exhibition Hall A-B ]
Abstract
Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7$\times$ maximal end-to-end QPS on real downstream tasks and 7.66$\times$ TTFT improvement.
Poster
Joshua Kazdan · Rylan Schaeffer · Apratim Dey · Matthias Gerstgrasser · Rafael Rafailov · David Donoho · Sanmi Koyejo
[ East Exhibition Hall A-B ]
Abstract
What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of “model collapse” as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test …
Spotlight Poster
Ken Ziyu Liu · Christopher A. Choquette Choo · Matthew Jagielski · Peter Kairouz · Sanmi Koyejo · Percy Liang · Nicolas Papernot
[ East Exhibition Hall A-B ]
Abstract
An important question today is whether a given text was used to train a large language model (LLM). A completion test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the n-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this n-gram based membership definition can be effectively gamed. We study scenarios where sequences are non-members for a given n and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of n for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of n. Our findings highlight the inadequacy of n-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.
Poster
Seng Pei Liew · Takuya Kato · Sho Takase
[ East Exhibition Hall A-B ]
Abstract
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
Poster
Fanxu Meng · Pingzhi Tang · Fan Jiang · Muhan Zhang
[ East Exhibition Hall A-B ]
Abstract
Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70\% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8\% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1$\times$ over LLaMA-2-7B.
Spotlight Poster
Sally Zhu · Ahmed Ahmed · Rohith Kuditipudi · Percy Liang
[ East Exhibition Hall A-B ]
Abstract
Motivated by liability and intellectual property concerns over open-weight models we consider the following problem: given the weights of two models, can we test whether they were trained independently---i.e., from independent random initializations? We consider two settings: *constrained* and *unconstrained*. In the constrained setting, we make assumptions about model architecture and training and propose statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. We compute the p-values by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures between the original two models versus these copies. We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like …
Poster
Yang Zhou · Hongyi Liu · Zhuoming Chen · Yuandong Tian · Beidi Chen
[ East Exhibition Hall A-B ]
Abstract
Recently, long-context large language models (LLMs) have shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs—and the ability to introduce noise by adding unnecessary nodes and edges—we develop a grade-school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-$\infty$ benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-$\infty$ benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.
Poster
Minghao Wu · Thuy-Trang Vu · Lizhen Qu · Reza Haffari
[ East Exhibition Hall A-B ]
Abstract
The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.
Poster
Qiyao Liang · Daoyuan Qian · Liu Ziyin · Ila R. Fiete
[ East Exhibition Hall A-B ]
Abstract
Composition—the ability to generate myriad variations from finite means—is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task with fully disentangled $(x,y)$ inputs, demonstrating that standard generative architectures still fail in OOD regions when training with partial data, by re-entangling latent representations in subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation via data superposition rather than via composition of the true factorized features. We show that when models are forced—through architectural modifications with regularization or curated training data—to render the disentangled latents into the full-dimensional representational (pixel) space, they can be highly data-efficient and effective at composing in OOD regions. These findings underscore that disentangled latents in an abstract representation are insufficient and show that if models can represent disentangled factors directly in the output representational space, it can achieve robust compositional …
Poster
Canhong Wen · Yihong Zuo · Wenliang Pan
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have showcased remarkable performance across a range of tasks but are hindered by their massive parameter sizes, which impose significant computational and storage demands. Pruning has emerged as an effective solution to reduce model size, but traditional methods often involve inefficient retraining or rely on heuristic-based one-shot approaches that lack theoretical guarantees. In this paper, we reformulate the pruning problem as an $\ell_0$-penalized optimization problem and propose a monotone accelerated Iterative Hard Thresholding (mAIHT) method. Our approach combines solid theoretical foundations with practical effectiveness, offering a detailed theoretical analysis that covers convergence, convergence rates, and risk upper bounds. Through extensive experiments, we demonstrate that mAIHT outperforms state-of-the-art pruning techniques by effectively pruning the LLaMA-7B model across various evaluation metrics.
Poster
Huayu Deng · Xiangming Zhu · Yunbo Wang · Xiaokang Yang
[ East Exhibition Hall A-B ]
Abstract
Graph neural networks have been a powerful tool for mesh-based physical simulation. To efficiently model large-scale systems, existing methods mainly employ hierarchical graph structures to capture multi-scale node relations. However, these graph hierarchies are typically manually designed and fixed, limiting their ability to adapt to the evolving dynamics of complex physical systems. We propose EvoMesh, a fully differentiable framework that jointly learns graph hierarchies and physical dynamics, adaptively guided by physical inputs. EvoMesh introduces anisotropic message passing, which enables direction-specific aggregation of dynamic features between nodes within each hierarchy, while simultaneously learning node selection probabilities for the next hierarchical level based on physical context. This design creates more flexible message shortcuts and enhances the model's capacity to capture long-range dependencies. Extensive experiments on five benchmark physical simulation datasets show that EvoMesh outperforms recent fixed-hierarchy message passing networks by large margins. The project page is available at https://hbell99.github.io/evo-mesh/.
Poster
Yongqiang Yao · Jingru Tan · Feizhao Zhang · Jiahao Hu · Yazhe Niu · JinXin · Bo Li · Pengfei Liu · Ruihao Gong · Dahua Lin · Ningyi Xu
[ East Exhibition Hall A-B ]
Abstract
Vision-language instruction-tuning models have recently achieved significant performance improvements. In this work, we discover that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. To address this issue, we rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices. Specifically, for the data, instances are grouped into new balanced mini-batches within and across devices. A search-based method is employed for the model to achieve a more balanced partitioning. For memory optimization, we adaptively adjust the re-computation strategy for each partition to utilize the available memory fully. These three perspectives are not independent but are closely connected, forming an omniverse balanced training framework. Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, training time is reduced greatly, achieving about 1.8$\times$ speed-up. Our method's efficacy and generalizability are further validated across various models and datasets. Codes will be released at https://github.com/ModelTC/OmniBal.
Poster
Avanika Narayan · Dan Biderman · Sabri Eyuboglu · Avner May · Scott Linderman · James Zou · Christopher Re
[ East Exhibition Hall A-B ]
Abstract
We investigate an emerging setup in which a small, on-device language model (LM) with access to local data collaborates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. *Can a local-remote collaboration reduce cloud inference costs while preserving quality?*First, we consider a naïve collaboration protocol, coined MINION, where the local and remote models simply chat back and forth. Because only the local model ingests the full context, this protocol reduces cloud costs by 30.4x, but recovers only 87% of the performance of the frontier model.We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we propose MINIONS, a protocol in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MINIONS reduces costs by 5.7× on average while recovering 97.9% of the remote-only performance. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.
Spotlight Poster
Jongwoo Ko · Tianyi Chen · Sungnyun Kim · Tianyu Ding · Luming Liang · Ilya Zharkov · Se-Young Yun
[ East Exhibition Hall A-B ]
Abstract
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Poster
Chenglong Wang · Yang Gan · Yifu Huo · Yongyu Mu · Qiaozhi He · MuRun Yang · Bei Li · Tong Xiao · Chunliang Zhang · Tongran Liu · Jingbo Zhu
[ East Exhibition Hall A-B ]
Abstract
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
Poster
Xubin Wang · Jianfei Wu · Yuan Yichen · Deyu Cai · Mingzhe Li · Weijia Jia
[ East Exhibition Hall A-B ]
Abstract
Diversity in demonstration selection is critical for enhancing model generalization by enabling broader coverage of structures and concepts. Constructing appropriate demonstration sets remains a key research challenge. This paper introduces the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning (RL) frameworks to optimize the selection of diverse reference demonstrations for tasks amenable to in-context learning (ICL), particularly text classification and reasoning, in few-shot prompting scenarios. RDES employs frameworks like Q-learning and a PPO-based variant to dynamically identify demonstrations that maximize both diversity (quantified by label distribution) and relevance to the task objective. This strategy ensures a balanced representation of reference data, leading to improved accuracy and generalization. Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (CoT) reasoning, which further boosts predictive performance. The results highlight the potential of RL for adaptive demonstration selection and addressing challenges in ICL.
Poster
KaShun SHUM · Yuzhen Huang · Hongjian Zou · dingqi · YiXuan Liao · Xiaoxin Chen · Qian Liu · Junxian He
[ East Exhibition Hall A-B ]
Abstract
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning, which shares similar intuition with Thrush et al. (2024). To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data …
Poster
Pranjal Aggarwal · Bryan Parno · Sean Welleck
[ East Exhibition Hall A-B ]
Abstract
Automated code generation with large language models has gained significant traction, but there remains no guarantee of the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement -- a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables the LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.
Poster
Qunzhong WANG · Xiangguo Sun · Hong Cheng
[ East Exhibition Hall A-B ]
Abstract
In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to original graphs without requiring retraining of pre-trained graph models across various applications. This novel paradigm, shifting from the traditional "pre-training and fine-tuning" to "pre-training and prompting," has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a "dark cloud" over the graph prompting area for deeper research. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold: **First**, we provide a formal guarantee theorem, demonstrating graph prompts’ capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks. **Second**, we derive upper bounds on the error of these data operations for a single graph and extend this discussion to batches of graphs, which are common in graph model training. **Third**, we analyze the distribution of data …
Poster
Nabeel Seedat · Mihaela van der Schaar
[ East Exhibition Hall A-B ]
Abstract
Schema matching -- the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce --- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.
Poster
Haofei Yu · Zhaochen Hong · Zirui Cheng · Kunlun Zhu · Keyang Xuan · Jinwei Yao · Tao Feng · Jiaxuan You
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi-agent framework for research community simulation. Within this framework, the human research community is simplified as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research community simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire pioneering research …
Poster
Dun Ma · Jianguo Chen · Wenguo Yang · Suixiang Gao · Shengminjie Chen
[ East Exhibition Hall A-B ]
Abstract
In recent years, the pursuit of higher expressive power in graph neural networks (GNNs) has often led to more complex aggregation mechanisms and deeper architectures. To address these issues, we have identified redundant structures in GNNs, and by pruning them, we propose Pruned MP-GNNs, K-Path GNNs, and K-Hop GNNs based on their original architectures. We show that 1) Although some structures are pruned in Pruned MP-GNNs and Pruned K-Path GNNs, their expressive power has not been compromised. 2) K-Hop MP-GNNs and their pruned architecture exhibit equivalent expressiveness on regular and strongly regular graphs. 3) The complexity of pruned K-Path GNNs and pruned K-Hop GNNs is lower than that of MP-GNNs, yet their expressive power is higher. Experimental results validate our refinements, demonstrating competitive performance across benchmark datasets with improved efficiency.
Poster
Samir Khaki · Xiuyu Li · Junxian Guo · Ligeng Zhu · Konstantinos N (Kostas) Plataniotis · Amir Yazdanbakhsh · Kurt Keutzer · Song Han · Zhijian Liu
[ East Exhibition Hall A-B ]
Abstract
Fine-tuning LLMs is both computationally andmemory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA,reduce the number of trainable parameters andlower memory usage, they do not decrease computational cost. In some cases, they may evenslow down fine-tuning. In this paper, we introduceSparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We proposea lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset ofweights for loss and gradient computation. Also,we systematically analyze and address sensitivityacross layers, tokens, and training steps. Our experimental results show that SparseLoRA reducescomputational cost by up to $2.0\times$ and a measuredspeedup of up to $1.5\times$ while maintaining accuracy across various downstream tasks, includingcommonsense and arithmetic reasoning, code generation, and instruction following.
Poster
Felipe Nuti · Tim Franzmeyer · Joao Henriques
[ East Exhibition Hall A-B ]
Abstract
Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking.In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model's intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models.We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component.Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass.Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component.We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that …
Poster
Yukang Yang · Declan Campbell · Kaixuan Huang · Mengdi Wang · Jonathan Cohen · Taylor Webb
[ East Exhibition Hall A-B ]
Abstract
Many recent studies have found evidence for emergent reasoning capabilities in large language models (LLMs), but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we study the internal mechanisms that support abstract reasoning in LLMs. We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, *symbol abstraction heads* convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, *symbolic induction heads* perform sequence induction over these abstract variables. Finally, in later layers, *retrieval heads* predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.
Poster
Senyu Han · Hongchuan Zeng · Kai Yu · Lu Chen
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) consist of numerous Transformer modules, and while the models can perform various functions, it remains an open question of how these modules are combined to elicit distinct inherent functionalities. In this paper, we investigate the modules inside LLMs and demonstrate that, by simply masking or retaining specific attention heads during inference, LLMs can exhibit specific task functionalities without requiring explicit instructions or modifications to the model parameters. Experiments across various models and tasks reveal that LLMs inherently encode ``functional pathways'', the structured groups of interdependent attention heads that are crucial for executing specific tasks. These pathways not only govern the model's functional behaviors but also enhance parameter efficiency, as suppressing attention heads outside the pathway can improve task performance. The code is available in this repository: [https://github.com/OpenDFM/HeadsUp](https://github.com/OpenDFM/HeadsUp).
Poster
Charlotte Peale · Vinod Raman · Omer Reingold
[ East Exhibition Hall A-B ]
Abstract
We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the ``group closure dimension'' as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.
Poster
Zecheng Tang · Zechen Sun · Juntao Li · Zhu Qiaoming · Min Zhang
[ East Exhibition Hall A-B ]
Abstract
Long-context models (LCMs) have shown great potential in processing long input sequences (even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO (Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8 x A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the …
Poster
Ruizhe Wang · Yeyun Gong · Xiao Liu · Guoshuai Zhao · Ziyue Yang · Baining Guo · Zheng-Jun Zha · Peng CHENG
[ East Exhibition Hall A-B ]
Abstract
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
Poster
Yifei Zhou · Qianlan Yang · Kaixiang Lin · Min Bai · Xiong Zhou · Yu-Xiong Wang · Sergey Levine · Li Li
[ East Exhibition Hall A-B ]
Abstract
A generalist foundation model agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent’s skill repertoire will necessarily be limited due to the scalability of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator (PAE), an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. After a context-aware task proposer generates instructions based on website information, the agent policy attempts those tasks in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and selfhosted websites from WebVoyager and WebArena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (around 50% relative improvement)to both unseen tasks and websites.
Poster
Haoyun Jiang · Haolin li · jianwei zhang · Fei Huang · Qiang Hu · Minmin Sun · Shuai Xiao · Yong Li · Junyang Lin · Jiangchao Yao
[ East Exhibition Hall A-B ]
Abstract
Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.
Poster
Yuan Li · Jun Hu · Zemin Liu · Bryan Hooi · Jia Chen · Bingsheng He
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) face significant computational challenges when handling large-scale graphs. To address this, Graph Condensation (GC) methods aim to compress large graphs into smaller, synthetic ones that are more manageable for GNN training. Recently, trajectory matching methods have shown state-of-the-art (SOTA) performance for GC, aligning the model's training behavior on a condensed graph with that on the original graph by guiding the trajectory of model parameters. However, these approaches require repetitive GNN retraining during condensation, making them computationally expensive. To address the efficiency issue, we completely bypass trajectory matching and propose a novel two-stage framework. The first stage, a precomputation stage, performs one-time message passing to extract structural and semantic information from the original graph. The second stage, a diversity-aware adaptation stage, performs class-wise alignment while maximizing the diversity of synthetic features. Remarkably, even with just the precomputation stage, which takes only seconds, our method either matches or surpasses 5 out of 9 baseline results. Extensive experiments show that our approach achieves comparable or better performance while being 96× to 2,455× faster than SOTA methods, making it more practical for large-scale GNN applications. Our code and data are available at https://github.com/Xtra-Computing/GCPA.
Poster
Tingyi Cai · Yunliang Jiang · Ming Li · Lu Bai · Changqin Huang · Yi Wang
[ East Exhibition Hall A-B ]
Abstract
With the growing adoption of Hypergraph Neural Networks (HNNs) to model higher-order relationships in complex data, concerns about their security and robustness have become increasingly important. However, current security research often overlooks the unique structural characteristics of hypergraph models when developing adversarial attack and defense strategies. To address this gap, we demonstrate that hypergraphs are particularly vulnerable to node injection attacks, which align closely with real-world applications. Through empirical analysis, we develop a relatively unnoticeable attack approach by monitoring changes in homophily and leveraging this self-regulating property to enhance stealth. Building on these insights, we introduce HyperNear, i.e., $\underline{N}$ode inj$\underline{E}$ction $\underline{A}$ttacks on hype$\underline{R}$graph neural networks, the first node injection attack framework specifically tailored for HNNs. HyperNear integrates homophily-preserving strategies to optimize both stealth and attack effectiveness. Extensive experiments show that HyperNear achieves excellent performance and generalization, marking the first comprehensive study of injection attacks on hypergraphs. Our code is available at https://github.com/ca1man-2022/HyperNear.
Spotlight Poster
Moshe Eliasof · Alessio Gravina · Andrea Ceni · Claudio Gallicchio · Davide Bacciu · Carola-Bibiane Schönlieb
[ East Exhibition Hall A-B ]
Abstract
Graph State Space Models (SSMs) have recently been introduced to enhance Graph Neural Networks (GNNs) in modeling long-range interactions. Despite their success, existing methods either compromise on permutation equivariance or limit their focus to pairwise interactions rather than sequences. Building on the connection between Autoregressive Moving Average (ARMA) and SSM, in this paper, we introduce GRAMA, a Graph Adaptive method based on a learnable ARMA framework that addresses these limitations. By transforming from static to sequential graph data, GRAMA leverages the strengths of the ARMA framework, while preserving permutation equivariance. Moreover, GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation. We also establish theoretical connections between GRAMA and Selective SSMs, providing insights into its ability to capture long-range dependencies. Experiments on 26 synthetic and real-world datasets demonstrate that GRAMA consistently outperforms backbone models and performs competitively with state-of-the-art methods.
Poster
Hang Gao · Huang Wenxuan · Fengge Wu · Zhao Junsuo · Changwen Zheng · Huaping Liu
[ East Exhibition Hall A-B ]
Abstract
The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.
Poster
Tianlang Chen · Charilaos Kanatsoulis · Jure Leskovec
[ East Exhibition Hall A-B ]
Abstract
Predictive tasks on relational databases are critical in real-world applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing RDL methods often overlook the intrinsic structural properties of the graphs built from relational databases, leading to modeling inefficiencies, particularly in handling many-to-many relationships. Here we introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases. At the core of our approach is the introduction of atomic routes, which are simple paths that enable direct single-hop interactions between the source and destination nodes. Building upon these atomic routes, RelGNN designs new composite message passing and graph attention mechanisms that reduce redundancy, highlight key signals, and enhance predictive accuracy. RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority of tasks, with improvements of up to 25%.
Poster
Zehong Wang · Zheyuan Zhang · Tianyi MA · Nitesh Chawla · Chuxu Zhang · Yanfang Ye
[ East Exhibition Hall A-B ]
Abstract
Graph learning tasks often hinge on identifying key substructure patterns---such as triadic closures in social networks or benzene rings in molecular graphs---that underpin downstream performance. However, most existing graph neural networks (GNNs) rely on message passing, which aggregates local neighborhood information iteratively and struggles to explicitly capture such fundamental motifs, like triangles, $k$-cliques, and rings. This limitation hinders both expressiveness and long-range dependency modeling. In this paper, we introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures. GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies. Empirical evaluations across four standard tasks---node classification, link prediction, graph classification, and graph regression---demonstrate that GPM outperforms state-of-the-art baselines. Further analysis reveals that GPM exhibits strong out-of-distribution generalization, desirable scalability, and enhanced interpretability. Code and datasets are available at: https://github.com/Zehong-Wang/GPM.
Poster
Steve Azzolin · SAGAR MALHOTRA · Andrea Passerini · Stefano Teso
[ East Exhibition Hall A-B ]
Abstract
Self-Explainable Graph Neural Networks (SE-GNNs) are popular explainable-by-design GNNs, but their explanations' properties and limitations are not well understood.Our first contribution fills this gap by formalizing the explanations extracted by some popular SE-GNNs, referred to as Minimal Explanations (MEs), and comparing them to established notions of explanations, namely Prime Implicant (PI) and faithful explanations.Our analysis reveals that MEs match PI explanations for a restricted but significant family of tasks. In general, however, they can be less informative than PI explanations and are surprisingly misaligned with widely accepted notions of faithfulness.Although faithful and PI explanations are informative, they are intractable to find and we show that they can be prohibitively large.Given these observations, a natural choice is to augment SE-GNNs with alternative modalities of explanations taking care of SE-GNNs’ limitations. To this end, we propose Dual-Channel GNNs that integrate a white-box rule extractor and a standard SE-GNN, adaptively combining both channels.Our experiments show that even a simple instantiation of Dual-Channel GNNs can recover succinct rules and perform on par or better than widely used SE-GNNs.
Poster
Yongqiang Chen · QUANMING YAO · Juzheng Zhang · James Cheng · Yatao Bian
[ East Exhibition Hall A-B ]
Abstract
Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higher-order molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/.
Poster
Zhehan Zhao · Lu Bai · Lixin Cui · Ming Li · Ziyu Lyu · Lixiang Xu · Yue Wang · Edwin Hancock
[ East Exhibition Hall A-B ]
Abstract
Graph Neural Networks (GNNs) have emerged as powerful tools for graph learning, and one key challenge arising in GNNs is the development of effective pooling operations for learning meaningful graph representations. In this paper, we propose a novel Edge-Node Attention-based Hierarchical Pooling (ENAHPool) operation for GNNs. Unlike existing cluster-based pooling methods that suffer from ambiguous node assignments and uniform edge-node information aggregation, ENAHPool assigns each node exclusively to a cluster and employs attention mechanisms to perform weighted aggregation of both node features within clusters and edge connectivity strengths between clusters, resulting in more informative hierarchical representations. To further enhance the model performance, we introduce a Multi-Distance Message Passing Neural Network (MD-MPNN) that utilizes edge connectivity strength information to enable direct and selective message propagation across multiple distances, effectively mitigating the over-squashing problem in classical MPNNs. Experimental results demonstrate the effectiveness of the proposed method.
Poster
Omer Ronen · Ahmed Imtiaz Humayun · Richard Baraniuk · Randall Balestriero · Bin Yu
[ East Exhibition Hall A-B ]
Abstract
We develop Latent Exploration Score (LES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES leverages the trained decoder’s approximation of the data distribution, and can be employed with any VAE decoder–including pretrained ones–without additional training, architectural changes or access to the training data. Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that LES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We believe that new avenues to LSO will be opened by LES’ ability to identify out of distribution areas, differentiability, and computational tractability.
Poster
Ivan Skorokhodov · Sharath Girish · Benran Hu · Willi Menapace · Yanyu Li · Rameen Abdal · Sergey Tulyakov · Aliaksandr Siarohin
[ East Exhibition Hall A-B ]
Abstract
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to $20$K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256. The source code is available at https://github.com/snap-research/diffusability.
Poster
Zhenxing Mi · Kuan-Chieh Wang · Guocheng Qian · Hanrong Ye · Runtao Liu · Sergey Tulyakov · Kfir Aberman · Dan Xu
[ East Exhibition Hall A-B ]
Abstract
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the **LLM decoder** shares the same input feature space with **diffusion decoders** that use the corresponding **LLM encoder** for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.
Poster
Wenbo Lu · Shaoyi Zheng · Yuxuan Xia · Shenji Wan
[ East Exhibition Hall A-B ]
Abstract
Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers’ quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose **To**ken **M**erge with **A**ttention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merging as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23% (DINO $\Delta <$ 0.07), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.
Poster
Long Zhao · Sanghyun Woo · Ziyu Wan · Yandong li · Han Zhang · Boqing Gong · Hartwig Adam · Xuhui Jia · Ting Liu
[ East Exhibition Hall A-B ]
Abstract
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality, which in turn enhances downstream generation quality by 22% at the same compression rates or provides 2.3x inference speedup through increasing compression rates. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.
Poster
Anji Liu · Xuejie Liu · Dayuan Zhao · Mathias Niepert · Yitao Liang · Guy Van den Broeck
[ East Exhibition Hall A-B ]
Abstract
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
Poster
Zhengchao Wan · Qingsong Wang · Gal Mishne · Yusu Wang
[ East Exhibition Hall A-B ]
Abstract
Flow matching (FM) models extend ODE sampler based diffusion models into a general framework, significantly reducing sampling steps through learned vector fields. However, the theoretical understanding of FM models, particularly how their sample trajectories interact with underlying data geometry, remains underexplored. A rigorous theoretical analysis of FM ODE is essential for sample quality, stability, and broader applicability. In this paper, we advance the theory of FM models through a comprehensive analysis of sample trajectories. Central to our theory is the discovery that the denoiser, a key component of FM models, guides ODE dynamics through attracting and absorbing behaviors that adapt to the data geometry. We identify and analyze the three stages of ODE evolution: in the initial and intermediate stages, trajectories move toward the mean and local clusters of the data. At the terminal stage, we rigorously establish the convergence of FM ODE under weak assumptions, addressing scenarios where the data lie on a low-dimensional submanifold---cases that previous results could not handle. Our terminal stage analysis offers insights into the memorization phenomenon and establishes equivariance properties of FM ODEs. These findings bridge critical gaps in understanding flow matching models, with practical implications for optimizing sampling strategies and architectures guided by …
Poster
Kushagra Pandey · Farrin Marouf Sofian · Felix Draxler · Theofanis Karaletsos · Stephan Mandt
[ East Exhibition Hall A-B ]
Abstract
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing \emph{Diffusion Trajectory Matching (DTM)} that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance.
Poster
Rajat Rasal · Avinash Kori · Fabio De Sousa Ribeiro · Tian Xia · Ben Glocker
[ East Exhibition Hall A-B ]
Abstract
Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To the best of our knowledge, ours is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.
Poster
Fang-Duo Tsai · Shih-Lun Wu · Weijaw Lee · Sheng-Ping Yang · Bo-Rui Chen · Hao-Chung Cheng · Yi-Hsuan Yang
[ East Exhibition Hall A-B ]
Abstract
We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainable parameters. Source code, model checkpoints, and demo examples are available at: https://MuseControlLite.github.io/web/
Poster
Theodoros Kouzelis · Ioannis Kakogeorgiou · Spyros Gidaris · Nikos Komodakis
[ East Exhibition Hall A-B ]
Abstract
Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a ×7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models.
Poster
Huafeng Liu · Yiran Fu · Liping Jing · Hui Li · Shuyang Lin · Jingyue Shi · Deqiang Ouyang · Jian Yu
[ East Exhibition Hall A-B ]
Abstract
Neural processes (NPs) are a promising paradigm to enable skill transfer learning across tasks with the aid of the distribution of functions. The previous NPs employ the empirical risk minimization principle in optimization. However, the fast adaption ability to different tasks can vary widely, and the worst fast adaptation can be catastrophic in risk-sensitive tasks. To achieve robust neural processes modeling, we consider the problem of training models in a risk-averse manner, which can control the worst fast adaption cases at a certain probabilistic level. By transferring the risk minimization problem to a two-level finite sum minimax optimization problem, we can easily solve it via a double-looped stochastic mirror prox algorithm with a task-aware variance reduction mechanism via sampling samples across all tasks. The mirror prox technique ensures better handling of complex constraint sets and non-Euclidean geometries, making the optimization adaptable to various tasks. The final solution, by aggregating prox points with the adaptive learning rates, enables a stable and high-quality output. The proposed learning strategy can work with various NPs flexibly and achieves less biased approximation with a theoretical guarantee. To illustrate the superiority of the proposed model, we perform experiments on both synthetic and real-world data, and the …
Poster
Alexander Lobashev · Dmitry Guskov · Maria Larchenko · Mikhail Tamm
[ East Exhibition Hall A-B ]
Abstract
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions.Our source code is available at \url{https://github.com/alobashev/hessian-geometry-of-diffusion-models}.
Poster
Yuhao Huang · Taos Transue · Shih-Hsin Wang · William Feldman · Hong Zhang · Bao Wang
[ East Exhibition Hall A-B ]
Abstract
Conditional flow matching (CFM) stands out as an efficient, simulation-free approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation between probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without significantly raising the computational cost. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos.
Poster
Masatoshi Uehara · su · Yulai Zhao · Xiner Li · Aviv Regev · Shuiwang Ji · Sergey Levine · Tommaso Biancalani
[ East Exhibition Hall A-B ]
Abstract
To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Finally, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and DNA design.
Spotlight Poster
Zhicheng Zhang · Wuyou Xia · Chenxi Zhao · Zhou Yan · Xiaoqiang Liu · Yongjie Zhu · Wenyu Qin · Pengfei Wan · Di ZHANG · Jufeng Yang
[ East Exhibition Hall A-B ]
Abstract
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Poster
Charles O'Neill · Alim Gumran · David Klindt
[ East Exhibition Hall A-B ]
Abstract
A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.
Spotlight Poster
Xuesong Wang · He Zhao · Edwin V. Bonilla
[ East Exhibition Hall A-B ]
Abstract
Neural Processes (NPs) are deep probabilistic models that represent stochastic processes by conditioning their prior distributions on a set of context points. Despite their advantages in uncertainty estimation for complex distributions, NPs enforce parameterization coupling between the conditional prior model and the posterior model. We show that this coupling amounts to prior misspecification and revisit the NP objective to address this issue. More specifically, we propose Rényi Neural Processes (RNP), a method that replaces the standard KL divergence with the Rényi divergence, dampening the effects of the misspecified prior during posterior updates. We validate our approach across multiple benchmarks including regression and image inpainting tasks, and show significant performance improvements of RNPs in real-world problems. Our extensive experiments show consistently better log-likelihoods over state-of-the-art NP models.
Poster
Kelsey Allen · Carl Doersch · Guangyao Zhou · Mohammed Suhail · Danny Driess · Ignacio Rocco · Yulia Rubanova · Thomas Kipf · Mehdi S. M. Sajjadi · Kevin Murphy · Joao Carreira · Sjoerd van Steenkiste
[ East Exhibition Hall A-B ]
Abstract
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion --- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: trajan-paper.github.io.
Poster
Lukas Lao Beyer · Tianhong Li · Xinlei Chen · Sertac Karaman · Kaiming He
[ East Exhibition Hall A-B ]
Abstract
Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called *1D* image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.
Poster
Mang Ning · Mingxiao Li · Jianlin Su · Jia Haozhe · Lanmiao Liu · Martin Benes · Wenshuo Chen · Albert Ali Salah · Itir Onal Ertugrul
[ East Exhibition Hall A-B ]
Abstract
This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to 512$\times$512 resolution without using the latent diffusion paradigm and beats latent diffusion (using SD-VAE) with only 1/4 training cost. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at https://github.com/forever208/DCTdiff.
Poster
Guangyi Wang · Wei Peng · lijiang Li · Wenyu Chen · Yuren Cai · Song-Zhi Su
[ East Exhibition Hall A-B ]
Abstract
While powerful for generation, Diffusion Probabilistic Models (DPMs) face slow sampling challenges, for which various distillation-based methods have been proposed. However, they typically require significant additional training costs and model parameter storage, limiting their practicality. In this work, we propose **P**CA-based **A**daptive **S**earch (PAS), which optimizes existing solvers for DPMs with minimal additional costs. Specifically, we first employ PCA to obtain a few basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S"-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. E.g., on CIFAR10, PAS optimizes DDIM's FID from 15.69 to 4.37 (NFE=10) using only **12 parameters and sub-minute training** on a single A100 GPU. Code is available at https://github.com/onefly123/PAS.
Poster
Mingyu Kang · Yong Suk Choi
[ East Exhibition Hall A-B ]
Abstract
Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
Poster
CHUANQI CHENG · Jian Guan · Wei Wu · Rui Yan
[ East Exhibition Hall A-B ]
Abstract
Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across five video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.
Poster
Zixiang Ai · Zichen Liu · Yuanhang Lei · Zhenyu Cui · Xu Zou · Jiahuan Zhou
[ East Exhibition Hall A-B ]
Abstract
Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19\% of trainable parameters.
Poster
Xiaoqian Shen · Yunyang Xiong · Changsheng Zhao · Lemeng Wu · Jun Chen · Chenchen Zhu · Zechun Liu · Fanyi Xiao · Balakrishnan Varadarajan · Florian Bordes · Zhuang Liu · Hu Xu · Hyunwoo Kim · Bilge Soran · Raghuraman Krishnamoorthi · Mohamed Elhoseiny · Vikas Chandra
[ East Exhibition Hall A-B ]
Abstract
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose \textbf{LongVU}, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
Poster
Konstantin Donhauser · Kristina Ulicna · Gemma Moran · Aditya Ravuri · Kian Kenyon-Dean · Cian Eastwood · Jason Hartford
[ East Exhibition Hall A-B ]
Abstract
Sparse dictionary learning (DL) has emerged as a powerful approach to extract semantically meaningful concepts from the internals of large language models (LLMs) trained mainly in the text domain. In this work, we explore whether DL can extract meaningful concepts from less human-interpretable scientific data, such as vision foundation models trained on cell microscopy images, where limited prior knowledge exists about which high-level concepts should arise. We propose a novel combination of a sparse DL algorithm, Iterative Codebook Feature Learning (ICFL), with a PCA whitening pre-processing step derived from control data. Using this combined approach, we successfully retrieve biologically meaningful concepts, such as cell types and genetic perturbations. Moreover, we demonstrate how our method reveals subtle morphological changes arising from human-interpretable interventions, offering a promising new direction for scientific discovery via mechanistic interpretability in bioimaging.
Poster
Puning Yang · Qizhou Wang · Zhuo Huang · Tongliang Liu · Chengqi Zhang · Bo Han
[ East Exhibition Hall A-B ]
Abstract
Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance---the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows:(i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements.(ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite.(iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions.Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance.Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research.Our code is …
Poster
Da Xiao · Qingye Meng · Shengping Li · xingyuan yuan
[ East Exhibition Hall A-B ]
Abstract
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving performance of Transformers trained with ~1.8x--2.4x compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer.
Spotlight Poster
Daoyuan Chen · Haibin Wang · Yilun Huang · Ce Ge · Yaliang Li · Bolin Ding · Jingren Zhou
[ East Exhibition Hall A-B ]
Abstract
The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite's usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.
Poster
Emanuel Ben Baruch · Adam Botach · Igor Kviatkovsky · Manoj Aggarwal · Gerard Medioni
[ East Exhibition Hall A-B ]
Abstract
With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images …
Poster
Dahun Shin · Dongyeop Lee · Jinseok Chung · Namhoon Lee
[ East Exhibition Hall A-B ]
Abstract
Approximate second-order optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods tend to converge to sharper minima compared to SGD.In response, we propose Sassha, a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution, while stabilizing the computation of approximate Hessians along the optimization trajectory.In fact, this sharpness minimization scheme is crafted also to accommodate lazy Hessian updates, so as to secure efficiency besides flatness.To validate its effectiveness, we conduct a wide range of standard deep learning experiments where Sassha demonstrates its outstanding generalization performance that is comparable to, and mostly better than, other methods.We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.
Poster
Kyle Richardson · Vivek Srikumar · Ashish Sabharwal
[ East Exhibition Hall A-B ]
Abstract
Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful …
Poster
Jerry Yao-Chieh Hu · Bo-Yu Chen · Dennis Wu · Feng Ruan · Han Liu
[ East Exhibition Hall A-B ]
Abstract
We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs.Interestingly,our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing *sparse-structured* modern Hopfield models with sub-quadratic complexity.We establish that this sparse model inherits the appealing theoretical properties of its dense analogue --- connection with transformer attention, fixed point convergence and exponential memory capacity.Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models.Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks.
Poster
JUNHAO HU · Wenrui Huang · Weidong Wang · Haoyi Wang · tiancheng hu · zhang qin · Hao Feng · Xusheng Chen · Yizhou Shan · Tao Xie
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-IndependentCaching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8× improvements in Time-To-First-Token (TTFT) and 7× throughput gains over existing systems, with negligible or no accuracy loss.
Poster
Tzu-Tao (Tommy) Chang · Shivaram Venkataraman
[ East Exhibition Hall A-B ]
Abstract
Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62$\times$ end-to-end speedup compared to existing approaches.
Poster
Peng Jin · Bo Zhu · Li Yuan · Shuicheng YAN
[ East Exhibition Hall A-B ]
Abstract
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a …
Poster
Xinyi Wu · Yifei Wang · Stefanie Jegelka · Ali Jadbabaie
[ East Exhibition Hall A-B ]
Abstract
Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and …
Poster
Ashkan Shahbazi · Elaheh Akbari · Darian Salehi · XINRAN LIU · Navid NaderiAlizadeh · Soheil Kolouri
[ East Exhibition Hall A-B ]
Abstract
While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at \url{https://github.com/dariansal/ESPFormer}.
Poster
Yilun Kuang · Noah Amsel · Sanae Lotfi · Shikai Qiu · Andres Potapczynski · Andrew Wilson
[ East Exhibition Hall A-B ]
Abstract
The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention.Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention.
Poster
Zichen Liu · Xu Zou · Gang Hua · Jiahuan Zhou
[ East Exhibition Hall A-B ]
Abstract
Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features.
Poster
Jerome Garnier-Brun · Marc Mezard · Emanuele Moscato · Luca Saglietti
[ East Exhibition Hall A-B ]
Abstract
Understanding the learning process and the embedded computation in transformers is becoming a central goal for the development of interpretable AI. In the present study, we introduce a hierarchical filtering procedure for data models of sequences on trees, allowing us to hand-tune the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformers can approximate the exact inference algorithm when trained on root classification and masked language modeling tasks, and study *how* this computation is discovered and implemented. We find that correlations at larger distances, corresponding to increasing layers of the hierarchy, are sequentially included by the network during training. By comparing attention maps from models trained with varying degrees of filtering and by probing the different encoder levels, we find clear evidence of a reconstruction of correlations on successive length scales corresponding to the various levels of the hierarchy, which we relate to a plausible implementation of the exact inference algorithm within the same architecture.
Poster
Jintao Zhang · Chendong Xiang · Haofeng Huang · Jia wei · Haocheng Xi · Jun Zhu · Jianfei Chen
[ East Exhibition Hall A-B ]
Abstract
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics.
Spotlight Poster
Connor Schenck · Isaac Reid · Mithun Jacob · Alex Bewley · Joshua Ainslie · David Rendleman · Deepali Jain · Mohit Sharma · Kumar Avinava Dubey · Ayzaan Wahid · Sumeet Singh · René Wagner · Tianli Ding · Chuyuan Fu · Arunkumar Byravan · Jacob J Varley · Alexey Gritsenko · Matthias Minderer · Dmitry Kalashnikov · Jonathan Tompson · Vikas Sindhwani · Krzysztof Choromanski
[ East Exhibition Hall A-B ]
Abstract
We introduce $\textbf{STRING}$: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides $\textbf{exact}$ translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers.We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics.
Poster
Taekyung Lee · Jaemoo Choi · Jaewoong Choi · Myungjoo Kang
[ East Exhibition Hall A-B ]
Abstract
Unpaired point cloud completion is crucial for real-world applications, where ground-truth data for complete point clouds are often unavailable. By learning a completion map from unpaired incomplete and complete point cloud data, this task avoids the reliance on paired datasets. In this paper, we propose the \textit{Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (\textbf{UOT-UPC})} model, which formulates the unpaired completion task as the (Unbalanced) Optimal Transport (OT) problem. Our method employs a Neural OT model learning the UOT map using neural networks. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior performance on both single-category and multi-category benchmarks. In particular, our approach is especially robust under the class imbalance problem, which is frequently encountered in real-world unpaired point cloud completion scenarios.
Poster
Lianbo Ma · Jianlun Ma · Yuee Zhou · Guoyang Xie · Qiang He · Zhichao Lu
[ East Exhibition Hall A-B ]
Abstract
Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization strategies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization strategies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantized model generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. It offers advantages such as no intricate computation of feature maps and high search efficiency. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150\% over the baselines.
Poster
Jan Blechschmidt · Tom-Christian Riemer · Max Winkler · Martin STOLL · Jan-Frederik Pietschmann
[ East Exhibition Hall A-B ]
Abstract
We develop a novel physics informed deep learning approach for solving nonlinear drift-diffusion equations on metric graphs. These models represent an important model class with a large number of applications in areas ranging from transport in biological cells to the motion of human crowds. While traditional numerical schemes require a large amount of tailoring, especially in the case of model design or parameter identification problems, physics informed deep operator networks (DeepONets) have emerged as a versatile tool for the solution of partial differential equations with the particular advantage that they easily incorporate parameter identification questions. We here present an approach where we first learn three DeepONet models for representative inflow, inner and outflow edges, resp., and then subsequently couple these models for the solution of the drift-diffusion metric graph problem by relying on an edge-based domain decomposition approach. We illustrate that our framework is applicable for the accurate evaluation of graph-coupled physics models and is well suited for solving optimization or inverse problems on these coupled networks.
Spotlight Poster
Chi Zhang · REN Lianhai · Jingpu Cheng · Qianxiao Li
[ East Exhibition Hall A-B ]
Abstract
The LoRA method has achieved notable success in reducing GPU memory usage by applying low-rank updates to weight matrices. Yet, one simple question remains: can we push this reduction even further? Furthermore, is it possible to achieve this while improving performance and reducing computation time? Answering these questions requires moving beyond the conventional weight-centric approach. In this paper, we present a state-based fine-tuning framework that shifts the focus from weight adaptation to optimizing forward states, with LoRA acting as a special example. Specifically, state-based tuning introduces parameterized perturbations to the states within the computational graph, allowing us to control states across an entire residual block. A key advantage of this approach is the potential to avoid storing large intermediate states in models like transformers. Empirical results across multiple architectures—including ViT, RoBERTa, LLaMA2-7B, and LLaMA3-8B—show that our method further reduces memory consumption and computation time while simultaneously improving performance. Moreover, as a result of memory reduction, we explore the feasibility to train 7B/8B models on consumer-level GPUs like Nvidia 3090, without model quantization. The code is available at an anonymous GitHub repository
Poster
Naz Sepahvand · Anvith Thudi · Berivan Isik · Ashmita Bhattacharyya · Nicolas Papernot · Eleni Triantafillou · Daniel Roy · Gintare Karolina Dziugaite
[ East Exhibition Hall A-B ]
Abstract
We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility–unlearning trade-off by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (R ´enyi) divergence to retraining without an individual datapoint. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while alsoidentifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.
Poster
Fangwen Wu · Lechao Cheng · Shengeng Tang · Xiaofeng Zhu · Chaowei Fang · Dingwen Zhang · Meng Wang
[ East Exhibition Hall A-B ]
Abstract
Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class's mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available at https://github.com/fwu11/MACIL.git.
Poster
Roman Klypa · Alberto Bietti · Sergei Grudinin
[ East Exhibition Hall A-B ]
Abstract
Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.
Poster
Huaicheng Zhou · Zifeng Zhuang · Donglin Wang
[ East Exhibition Hall A-B ]
Abstract
The integration of Deep Neural Networks in Reinforcement Learning (RL) systems has led to remarkable progress in solving complex tasks but also introduced challenges like primacy bias and dead neurons. Primacy bias skews learning towards early experiences, while dead neurons diminish the network's capacity to acquire new knowledge. Traditional reset mechanisms aimed at addressing these issues often involve maintaining large replay buffers to train new networks or selectively resetting subsets of neurons. However, these approaches either incur prohibitive computational costs or reset network parameters without ensuring stability through recovery mechanisms, ultimately impairing learning efficiency. In this work, we introduce the novel concept of neuron regeneration, which combines reset mechanisms with knowledge recovery techniques. We also propose a new framework called Sustainable Backup Propagation(SBP) that effectively maintains plasticity in neural networks through this neuron regeneration process. The SBP framework achieves whole network neuron regeneration through two key procedures: cycle reset and inner distillation. Cycle reset involves a scheduled renewal of neurons, while inner distillation functions as a knowledge recovery mechanism at the neuron level. To validate our framework, we integrate SBP with Proximal Policy Optimization (PPO) and propose a novel distillation function for inner distillation. This integration results in Plastic PPO …
Poster
Xuekai Zhu · Daixuan Cheng · Hengli Li · Kaiyan Zhang · Ermo Hua · Xingtai Lv · Ning Ding · Zhouhan Lin · Zilong Zheng · Bowen Zhou
[ East Exhibition Hall A-B ]
Abstract
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
Poster
Lexiang Hu · Yikang Li · Zhouchen Lin
[ East Exhibition Hall A-B ]
Abstract
Symmetry is widely applied in problems such as the design of equivariant networks and the discovery of governing equations, but in complex scenarios, it is not known in advance. Most previous symmetry discovery methods are limited to linear symmetries, and recent attempts to discover nonlinear symmetries fail to explicitly get the Lie algebra subspace. In this paper, we propose LieNLSD, which is, to our knowledge, the first method capable of determining the number of infinitesimal generators with nonlinear terms and their explicit expressions. We specify a function library for the infinitesimal group action and aim to solve for its coefficient matrix, proving that its prolongation formula for differential equations, which governs dynamic data, is also linear with respect to the coefficient matrix. By substituting the central differences of the data and the Jacobian matrix of the trained neural network into the infinitesimal criterion, we get a system of linear equations for the coefficient matrix, which can then be solved using SVD. On top quark tagging and a series of dynamic systems, LieNLSD shows qualitative advantages over existing methods and improves the long rollout accuracy of neural PDE solvers by over $20\\%$ while applying to guide data augmentation. Code and data …
Spotlight Poster
Hao Li · Qi Lv · Rui Shao · Xiang Deng · Yinchuan Li · Jianye Hao · Liqiang Nie
[ East Exhibition Hall A-B ]
Abstract
Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation.Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present **S**kill **T**raining with **A**ugmented **R**otation (**STAR**), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ).It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions.Further, to capture the casual relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation.Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.
Poster
Yaofo Chen · Zeng You · Shuhai Zhang · Haokun Li · Yirui Li · Yaowei Wang · Mingkui Tan
[ East Exhibition Hall A-B ]
Abstract
Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention. However, when the context length L becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase. The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules: 1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance. In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling. 2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation. Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost. Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.
Poster
Guoqing Zhang · Shichao Kan · Fanghui Zhang · Wanru Xu · Yue Zhang · Yigang Cen
[ East Exhibition Hall A-B ]
Abstract
Scene Graph Generation (SGG) is a fundamental task in visual understanding, aimed at providing more precise local detail comprehension for downstream applications. Existing SGG methods often overlook the diversity of predicate representations and the consistency among similar predicates when dealing with long-tail distributions. As a result, the model's decision layer fails to effectively capture details from the tail end, leading to biased predictions. To address this, we propose a Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization (NoDIS) method. On the one hand, expanding the predicate representation space enhances the model's ability to learn both common and rare predicates, thus reducing prediction bias caused by data scarcity. We propose a conditional diffusion model to reconstructs features and increase the diversity of representations for same category predicates. On the other hand, independent predicate representations in the decision phase increase the learning complexity of the decision layer, making accurate predictions more challenging. To address this issue, we introduce a discretization mapper that learns consistent representations among similar predicates, reducing the learning difficulty and decision ambiguity in the decision layer. To validate the effectiveness of our method, we integrate NoDIS with various SGG baseline models and conduct experiments on multiple datasets. The results consistently …
Spotlight Poster
Andrew Wilson
[ East Exhibition Hall A-B ]
Abstract
Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization.This position paper argues that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized, using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.
Poster
Jing Yang
[ East Exhibition Hall A-B ]
Abstract
The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot ([papercopilot.com](https://papercopilot.com/)), a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18–34 from 177 countries, many of whom are actively engaged in the peer review process. \textit{Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.
Poster
Sayash Kapoor · Noam Kolt · Seth Lazar
[ East Exhibition Hall A-B ]
Abstract
Language model agents are poised to mediate how people navigate and act online. If the companies that already dominate internet search, communication, and commerce—or the firms trying to unseat them—control these agents, the resulting *platform agents* will likely deepen surveillance, tighten lock-in, and further entrench incumbents. To resist that trajectory, this position paper argues that we should promote *agent advocates*: user-controlled agents that safeguard individual autonomy and choice. Doing so demands three coordinated moves: broad public access to both compute and capable AI models that are not platform-owned, open interoperability and safety standards, and market regulation that prevents platforms from foreclosing competition.
Poster
Jan Kulveit · Raymond Douglas · Nora Ammann · Deger Turan · David Krueger · David Duvenaud
[ East Exhibition Hall A-B ]
Abstract
This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety. We analyze how even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human preferences that often arise from societal systems' reliance on human participation to function. Furthermore, AI systems may amplify existing misalignments with human preferences by optimizing these systems more powerfully. These distortions across domains may be mutually reinforcing: economic power shapes cultural narratives and political decisions, while cultural shifts alter economic and political behavior. We argue that this dynamic could lead to an effectively irreversible loss of human influence over crucial societal systems, precipitating an existential catastrophe through the permanent disempowerment of humanity. This analysis suggests the need for both technical research and governance approaches that specifically address the risk of incremental erosion of human influence across interconnected societal systems.
Poster
Rashid Mushkani · Hugo Berard · Allison Cohen · Shin Koseki
[ East Exhibition Hall A-B ]
Abstract
This position paper proposes a “Right to AI,” which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated by the increasing deployment of AI in critical domains and inspired by Henri Lefebvre's concept of the “Right to the City,” we reconceptualize AI as a societal infrastructure, rather than merely a product of expert design. In this paper, we critically evaluate how generative agents, large-scale data extraction, and diverse cultural values bring new complexities to AI oversight. The paper proposes that grassroots participatory methodologies can mitigate biased outcomes and enhance social responsiveness. It asserts that data is socially produced and should be managed and owned collectively. Drawing on Sherry Arnstein’s Ladder of Citizen Participation and analyzing nine case studies, the paper develops a four-tier model for the Right to AI that situates the current paradigm and envisions an aspirational future. It proposes recommendations for inclusive data ownership, transparent design processes, and stakeholder-driven oversight. We also discuss market-led and state-centric alternatives and argue that participatory approaches offer a better balance between technical efficiency and democratic legitimacy.
Poster
Serena Booth
[ East Exhibition Hall A-B ]
Abstract
Consumer protection laws are designed to protect consumers from unethical business practices. In this position paper, I argue that these laws serve an emergent dual purpose: if appropriately enforced and strengthened, consumer protection laws can serve as an inalienable defense for AI safety. These laws are well established and can be enforced and strengthened to incentivize businesses to design and deploy safer AI systems. This position runs counter to two prevailing trends in AI policy. The first alternative position is that AI safety requires an entirely new set of focused laws to protect humanity's prosperity. Though I find these efforts valuable, I argue that such focused laws are both hard to write and easy to skirt. The second alternative position is that consumer protection is nothing more than red tape; I argue that existing laws dating back many decades have already reigned in some nefarious business practices related to the development and deployment of AI, and that the litigious society of the United States is well-positioned to use consumer protection laws to encourage new AI safety guardrails. This paper takes a tour of some existing consumer protection laws in the United States and their effects on the development and use …
Poster
Borhane Blili-Hamelin · Christopher Graziul · Leif Hancox-Li · Hananel Hazan · El-Mahdi El-Mhamdi · Avijit Ghosh · Katherine Heller · Jacob Metcalf · Fabricio Murai · Eryk Salvaggio · Andrew Smart · Todd Snider · Mariame Tighanimine · Talia Ringer · Margaret Mitchell · Shiri Dori-Hacohen
[ East Exhibition Hall A-B ]
Abstract
The AI research community plays a vital role in shaping the scientific, engineering, and societal goals of AI research. In this position paper, we argue that focusing on the highly contested topic of 'artificial general intelligence' ('AGI') undermines our ability to choose effective goals. We identify six key traps---obstacles to productive goal setting---that are aggravated by AGI discourse: Illusion of Consensus, Supercharging Bad Science, Presuming Value-Neutrality, Goal Lottery, Generality Debt, and Normalized Exclusion. To avoid these traps, we argue that the AI research community needs to (1) prioritize specificity in scientific, engineering, and societal goals, (2) center pluralism about multiple worthwhile approaches to multiple valuable goals, and (3) foster innovation through greater inclusion of disciplines and communities. Therefore, the AI research community needs to stop treating ``AGI'' as the north-star goal of AI research.
Poster
XIAOXUAN HAN · Songlin Yang · Wei Wang · Yang Li · JING DONG
[ East Exhibition Hall A-B ]
Abstract
Text-to-image (T2I) diffusion models have raised concerns about generating inappropriate content, such as "*nudity*". Despite efforts to erase undesirable concepts through unlearning techniques, these unlearned models remain vulnerable to adversarial inputs that can potentially regenerate such content. To safeguard unlearned models, we propose a novel inference-time defense strategy that mitigates the impact of adversarial inputs. Specifically, we first reformulate the challenge of ensuring robustness in unlearned diffusion models as a robust regression problem. Building upon the naive median smoothing for regression robustness, which employs isotropic Gaussian noise, we develop a generalized median smoothing framework that incorporates anisotropic noise. Based on this framework, we introduce a token-wise ***Adaptive Median Smoothing*** method that dynamically adjusts noise intensity according to each token's relevance to target concepts. Furthermore, to improve inference efficiency, we explore implementations of this adaptive method at the text-encoding stage. Extensive experiments demonstrate that our approach enhances adversarial robustness while preserving model utility and inference efficiency, outperforming baseline defense techniques.
Spotlight Poster
Nicholas Carlini · Edoardo Debenedetti · Javier Rando · Milad Nasr · Florian Tramer
[ East Exhibition Hall A-B ]
Abstract
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4's 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
Poster
Haoming Yang · Ke Ma · Xiaojun Jia · Yingfei Sun · Qianqian Xu · Qingming Huang
[ East Exhibition Hall A-B ]
Abstract
Despite the remarkable performance of Large Language Models (\textbf{LLMs}), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, \textbf{ICRT}, inspired by heuristics and biases in human cognition. Leveraging the \textit{simplicity effect}, we employ \textit{cognitive decomposition} to reduce the complexity of malicious prompts. Simultaneously, \textit{relevance bias} is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream \textbf{LLMs}' safety mechanisms and generates high-risk content.
Oral Poster
Jaeho Kim · Yunseok Lee · Seulki Lee
[ East Exhibition Hall A-B ]
Abstract
The peer review process in major artificial intelligence (AI) conferences faces unprecedented challenges with the surge of paper submissions (exceeding 10,000 submissions per venue), accompanied by growing concerns over review quality and reviewer responsibility. This position paper argues for **the need to transform the traditional one-way review system into a bi-directional feedback loop where authors evaluate review quality and reviewers earn formal accreditation, creating an accountability framework that promotes a sustainable, high-quality peer review system.** The current review system can be viewed as an interaction between three parties: the authors, reviewers, and system (i.e., conference), where we posit that all three parties share responsibility for the current problems. However, issues with authors can only be addressed through policy enforcement and detection tools, and ethical concerns can only be corrected through self-reflection. As such, this paper focuses on reforming reviewer accountability with systematic rewards through two key mechanisms: (1) a two-stage bi-directional review system that allows authors to evaluate reviews while minimizing retaliatory behavior, (2) a systematic reviewer reward system that incentivizes quality reviewing. We ask for the community's strong interest in these problems and the reforms that are needed to enhance the peer review process.
Poster
Maya Bechler-Speicher · Ben Finkelshtein · Fabrizio Frasca · Luis Müller · Jan M Tönshoff · Antoine Siraudin · Viktor Zaverkin · Michael Bronstein · Mathias Niepert · Bryan Perozzi · Mikhail Galkin · Christopher Morris
[ East Exhibition Hall A-B ]
Abstract
While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, real-world applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.
Poster
Matej Pičulin · Bernarda Petek · Irena Ograjenšek · Erik Štrumbelj
[ East Exhibition Hall A-B ]
Abstract
In this position paper, we argue that user studies are key to understanding the value of explainable AI methods, because the end goal of explainable AI is to satisfy societal desiderata. We also argue that the current state of user studies is detrimental to the advancement of the field. We support this argument with a review of general and explainable AI-specific challenges, as well as an analysis of 607 explainable AI papers featuring user studies. We demonstrate how most user studies lack reproducibility, discussion of limitations, comparison with a baseline, or placebo explanations and are of low fidelity to real-world users and application context. This, combined with an overreliance on functional evaluation, results in a lack of understanding of the value explainable AI methods, which hinders the progress of the field. To address this issue, we call for higher methodological standards for user studies, greater appreciation of high-quality user studies in the AI community, and reduced reliance on functional evaluation.
Poster
Tennison Liu · Mihaela van der Schaar
[ East Exhibition Hall A-B ]
Abstract
Self-improving agents aim to continuously acquire new capabilities with minimal supervision. However, current approaches face two key limitations: their self-improvement processes are often rigid, fail to generalize across tasks domains, and struggle to scale with increasing agent capabilities. We argue that effective self-improvement requires intrinsic metacognitive learning, defined as an agent’s $\textit{intrinsic}$ ability to actively evaluate, reflect on, and adapt its own learning processes. Drawing inspiration from human metacognition, we introduce a formal framework comprising three components: $\textit{metacognitive knowledge}$ (self-assessment of capabilities, tasks, and learning strategies), $\textit{metacognitive planning}$ (deciding what and how to learn), and $\textit{metacognitive evaluation}$ (reflecting on learning experiences to improve future learning). Analyzing existing self-improving agents, we find they rely predominantly on $\textit{extrinsic}$ metacognitive mechanisms, which are fixed, human-designed loops that limit scalability and adaptability. Examining each component, we contend that many ingredients for intrinsic metacognition are already present. Finally, we explore how to optimally distribute metacognitive responsibilities between humans and agents, and robustly evaluate and improve intrinsic metacognitive learning, key challenges that must be addressed to enable truly sustained, generalized, and aligned self-improvement.
Poster
Rui Zhang · Yun Shen · Hongwei Li · Wenbo Jiang · Hanxiao Chen · Yuan Zhang · Guowen Xu · Yang Zhang
[ East Exhibition Hall A-B ]
Abstract
Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks.These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks.In reality, these PTLMs can be adapted to many other unrelated downstream tasks.Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness.We refer to this phenomenon as backdoor complications.In this paper, we undertake the first comprehensive quantification of backdoor complications.Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs.The output distribution of triggered samples significantly deviates from that of clean samples.Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks.The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks.
Spotlight Poster
Yangsibo Huang · Milad Nasr · Anastasios Angelopoulos · Nicholas Carlini · Wei-Lin Chiang · Christopher A. Choquette Choo · Daphne Ippolito · Matthew Jagielski · Katherine Lee · Ken Ziyu Liu · Ion Stoica · Florian Tramer · Chiyuan Zhang
[ East Exhibition Hall A-B ]
Abstract
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness …
Poster
Wonjun Lee · Doehyeon Lee · Eugene Choi · Sangyoon Yu · Ashkan Yousefpour · Haon Park · Bumsub Ham · Suhyun Kim
[ East Exhibition Hall A-B ]
Abstract
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.
Poster
Konstantin Kirchheim · Frank Ortmeier
[ East Exhibition Hall A-B ]
Abstract
Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov logic networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness.
Poster
Guangzhi Sun · Xiao Zhan · Shutong Feng · Phil Woodland · Jose Such
[ East Exhibition Hall A-B ]
Abstract
Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware SafEty Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments ($p<$0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts. Code and data used in the paper are available at https://anonymous.4open.science/r/CASEBench-D5DB.
Poster
Jingwei Li · Jing Dong · Tianxing He · Jingzhao Zhang
[ East Exhibition Hall A-B ]
Abstract
Given the rising popularity of AI-generated art and the associated copyright concerns, identifying whether an artwork was used to train a diffusion model is an important research topic. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitation of applying existing MIA methods for proprietary diffusion models: the required access of internal U-nets.To address the above problem, we introduce a novel membership inference attack method that uses only the image-to-image variation API and operates without access to the model's internal U-net. Our method is based on the intuition that the model can more easily obtain an unbiased noise prediction estimate for images from the training set. By applying the API multiple times to the target image, averaging the outputs, and comparing the result to the original image, our approach can classify whether a sample was part of the training set. We validate our method using DDIM and Stable Diffusion setups and further extend both our approach and existing algorithms to the Diffusion Transformer architecture. Our experimental results consistently outperform previous methods.
Spotlight Poster
Yichi Zhang · Siyuan Zhang · Yao Huang · Zeyu Xia · Zhengwei Fang · Xiao Yang · Ranjie Duan · Dong Yan · Yinpeng Dong · Jun Zhu
[ East Exhibition Hall A-B ]
Abstract
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose **STAIR**, a novel framework that integrates **S**afe**T**y **A**lignment with **I**trospective **R**easoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). Specifically, we design a theoretically grounded reward for outcome evaluation to seek balance between helpfulness and safety. We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. We have open-sourced our code, datasets and models at https://github.com/thu-ml/STAIR.
Poster
Rongzhe Wei · Mufei Li · Mohsen Ghassemi · Eleonora Kreacic · Yifan Li · Xiang Yue · Bo Li · Vamsi Potluru · Pan Li · Eli Chien
[ East Exhibition Hall A-B ]
Abstract
Large Language Models (LLMs) embed sensitive, human-generated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20\% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.
Poster
Anders Aamand · Justin Chen · Mina Dalirrooyfard · Slobodan Mitrovic · Yuriy Nevmyvaka · Sandeep Silwal · Yinzhan Xu
[ East Exhibition Hall A-B ]
Abstract
We study differentially private algorithms for graph cut sparsification, a fundamental problem in algorithms, privacy, and machine learning. While significant progress has been made, the best-known private and efficient cut sparsifiers on $n$-node graphs approximate each cut within $\widetilde{O}(n^{1.5})$ additive error and $1+\gamma$ multiplicative error for any $\gamma > 0$ [Gupta, Roth, Ullman TCC'12]. In contrast, \emph{inefficient} algorithms, i.e., those requiring exponential time, can achieve an $\widetilde{O}(n)$ additive error and $1+\gamma$ multiplicative error [Eliáš, Kapralov, Kulkarni, Lee SODA'20]. In this work, we break the $n^{1.5}$ additive error barrier for private and efficient cut sparsification. We present an $(\varepsilon,\delta)$-DP polynomial time algorithm that, given a non-negative weighted graph, outputs a private synthetic graph approximating all cuts with multiplicative error $1+\gamma$ and additive error $n^{1.25 + o(1)}$ (ignoring dependencies on $\varepsilon, \delta, \gamma$). At the heart of our approach lies a private algorithm for expander decomposition, a popular and powerful technique in (non-private) graph algorithms.
Poster
Shoukai Xu · ZihaoLian · Mingkui Tan · Liu Liu · Zhong Zhang · Peilin Zhao
[ East Exhibition Hall A-B ]
Abstract
Offline reinforcement learning is widely applied in multiple fields due to its advantages in efficiency and risk control. However, a major problem it faces is the distribution shift between offline datasets and online environments. This mismatch leads to out-of-distribution (OOD) state-action pairs that fall outside the scope of the training data. Therefore, existing conservative training policies may not provide reliable decisions when the test environment deviates greatly from the offline dataset. In this paper, we propose Test-time Adapted Reinforcement Learning (TARL) to address this problem. TARL constructs unsupervised test-time optimization objectives for discrete and continuous control tasks, using test data without depending on environmental rewards. In discrete control tasks, it minimizes the entropy of predicted action probabilities to decrease uncertainty and avoid OOD state-action pairs. For continuous control tasks, it represents and minimizes action uncertainty based on the normal distribution of policy network outputs. Moreover, to prevent model bias caused by overfitting and error accumulation during the test-time update process, TARL enforces a KL divergence constraint between the fine-tuned policy and the original policy. For efficiency, TARL only updates the layer normalization layer parameters during testing. Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority …
Poster
Edith Cohen · Mihir Singhal · Uri Stemmer
[ East Exhibition Hall A-B ]
Abstract
Cardinality sketches are compact data structures that efficiently estimate the number of distinct elements across multiple queries while minimizing storage, communication, and computational costs. However, recent research has shown that these sketches can fail under {\em adaptively chosen queries}, breaking down after approximately $\tilde{O}(k^2)$ queries, where $k$ is the sketch size.In this work, we overcome this \emph{quadratic barrier} by designing robust estimators with fine-grained guarantees. Specifically, our constructions can handle an {\em exponential number of adaptive queries}, provided that each element participates in at most $\tilde{O}(k^2)$ queries. This effectively shifts the quadratic barrier from the total number of queries to the number of queries {\em sharing the same element}, which can be significantly smaller. Beyond cardinality sketches, our approach expands the toolkit for robust algorithm design.
Poster
Min Chen · Guansong Pang · Wenjun Wang · Cheng Yan
[ East Exhibition Hall A-B ]
Abstract
Spatial-temporal forecasting (STF) plays a pivotal role in urban planning and computing. Spatial-Temporal Graph Neural Networks (STGNNs) excel at modeling spatial-temporal dynamics, thus being robust against noise perturbations. However, they often suffer from relatively poor computational efficiency. Simplifying the architectures can improve efficiency but also weakens robustness with respect to noise interference. In this study, we investigate the problem: *can simple neural networks such as Multi-Layer Perceptrons (MLPs) achieve robust spatial-temporal forecasting while remaining efficient?* To this end, we first reveal the *dual noise effect* in spatial-temporal data and propose a theoretically grounded principle termed *Robust Spatial-Temporal Information Bottleneck* (RSTIB), which holds strong potential for improving model robustness. We then design an implementation named *RSTIB-MLP*, together with a new training regime incorporating a knowledge distillation module, to enhance the robustness of MLPs for STF while maintaining their efficiency. Comprehensive experiments demonstrate that *RSTIB-MLP* achieves an excellent trade-off between robustness and efficiency, outperforming state-of-the-art STGNNs and MLP-based models. Our code is publicly available at: [https://github.com/mchen644/RSTIB](https://github.com/mchen644/RSTIB).
Spotlight Poster
Nikolaus Howe · Ian McKenzie · Oskar Hollinsworth · Michał Zając · Tom Tseng · Aaron Tucker · Pierre-Luc Bacon · Adam Gleave
[ East Exhibition Hall A-B ]
Abstract
Increasing model size has unlocked a dazzling array of capabilities in language models.At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust.As both attackers and defenders gain access to more compute, and as models become larger, what will be the effect on robustness?We argue that to answer this question requires a *scaling lens*, which we adopt in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks.We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency.Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and adversarially trained models.Finally, after exploring robustness transfer across attacks and threat models, we combine attack and defense scaling rates to study the offense-defense balance.We find that while attack scaling outpaces adversarial training across all models studied, larger adversarially trained models might give defense the advantage in the long run.These results underscore the utility of the scaling lens, and provide a paradigm for evaluating future attacks and defenses on frontier models.Code for this …
Poster
Maya Pavlova · Erik Brinkman · Krithika Iyer · Vítor Albiero · Joanna Bitton · Hailey Nguyen · Cristian Canton · Ivan Evtimov · Aaron Grattafiori
[ East Exhibition Hall A-B ]
Abstract
Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and …
Poster
Rachel Cummings · Alessandro Epasto · Jieming Mao · Tamalika Mukherjee · Tingting Ou · Peilin Zhong
[ East Exhibition Hall A-B ]
Abstract
The *turnstile* continual release model of differential privacy captures scenarios where a privacy-preserving real-time analysis is sought for a dataset evolving through additions and deletions. In typical applications of real-time data analysis, both the length of the stream $T$ and the size of the universe $|\mathcal{U}|$ from which data come can be extremely large. This motivates the study of private algorithms in the turnstile setting using space sublinear in both $T$ and $|\mathcal{U}|$. In this paper, we give the first sublinear space differentially private algorithms for the fundamental problems of counting distinct elements in the turnstile streaming model. Our algorithm achieves, on arbitrary streams, $O_{\eta}(T^{1/3})$ space and additive error, and a $(1+\eta)$-relative approximation for all $\eta \in (0,1)$. Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear, approaching the known $\Omega(T^{1/4})$ additive error lower bound for arbitrary streams. Moreover, when a bound $W$ on the number of times an item appears in the stream is known, our algorithm provides $O_{\eta}(\sqrt{W})$ additive error, using $O_{\eta}(\sqrt{W})$ space. This additive error asymptotically matches that of prior work which required instead linear space. Our results address an open question posed by Jain et al. about …
Poster
Guy Kornowski · Daogao Liu · Kunal Talwar
[ East Exhibition Hall A-B ]
Abstract
We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works.We start by providing a single-pass $(\epsilon,\delta)$-DP algorithm that returns an $(\alpha,\beta)$-stationary point as long as the dataset is of size $\widetilde{\Omega}(\sqrt{d}/\alpha\beta^{3}+d/\epsilon\alpha\beta^{2})$, which is $\Omega(\sqrt{d})$ times smaller than the algorithm of Zhang et al. (2024) for this task, where $d$ is the dimension.We then provide a multi-pass polynomial time algorithm which further improves the sample complexity to $\widetilde{\Omega}\left(d/\beta^2+d^{3/4}/\epsilon\alpha^{1/2}\beta^{3/2}\right)$, by designing a sample efficient ERM algorithm, and proving that Goldstein-stationary points generalize from the empirical loss to the population loss.
Poster
Kelly Ramsay · Jairo Diaz-Rodriguez
[ East Exhibition Hall A-B ]
Abstract
Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently, which is not always the case for a boxplot naively constructed from a single existing differentially private quantile algorithm. As a byproduct of this exposition, we introduce several new results concerning private quantile estimation. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms the naive boxplot. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization.
Poster
Fan Qi · Daxu Shi · Chuokun Xu · Shuai Li · Changsheng Xu
[ East Exhibition Hall A-B ]
Abstract
Federated Distillation (FedKD) relies on lightweight knowledge carriers like logits for efficient client-server communication. Although logit-based methods have demonstrated promise in addressing statistical and architectural heterogeneity in federated learning (FL), current approaches remain constrained by suboptimal temperature calibration during knowledge fusion.To address these limitations, we propose ReT-FHD, a framework featuring: 1) Multi-level Elastic Temperature, which dynamically adjusts distillation intensities across model layers, achieving optimized knowledge transfer between heterogeneous local models; 2) Category-Aware Global Temperature Scaling that implements class-specific temperature calibration based on confidence distributions in global logits, enabling personalized distillation policies; 3) Z-Score Guard, a blockchain-verified validation mechanism mitigating 44\% of label-flipping and model poisoning attacks. Evaluations across diverse benchmarks with varying model/data heterogeneity demonstrate that the ReT-FHD achieves significant accuracy improvements over baseline methods while substantially reducing communication costs compared to existing approaches. Our work establishes that properly calibrated logits can serve as self-sufficient carriers for building scalable and secure heterogeneous FL systems.
Poster
Omri Ben Hemo · Alon Zolfi · Oryan Yehezkel · Omer Hofman · Roman Vainshtein · Hisashi Kojima · Yuval Elovici · Asaf Shabtai
[ East Exhibition Hall A-B ]
Abstract
Federated learning (FL) enables privacy-preserving distributed machine learning by sharing gradients instead of raw data. However, FL remains vulnerable to gradient inversion attacks, in which shared gradients can reveal sensitive training data. Prior research has mainly concentrated on unimodal tasks, particularly image classification, examining the reconstruction of single-modality data, and analyzing privacy vulnerabilities in these relatively simple scenarios. As multimodal models are increasingly used to address complex vision-language tasks, it becomes essential to assess the privacy risks inherent in these architectures. In this paper, we explore gradient inversion attacks targeting multimodal vision-language Document Visual Question Answering (DQA) models and propose GI-DQA, a novel method that reconstructs private document content from gradients. Through extensive evaluation on state-of-the-art DQA models, our approach exposes critical privacy vulnerabilities and highlights the urgent need for robust defenses to secure multimodal FL systems.
Poster
Hilal Asi · Vinod Raman · Kunal Talwar
[ East Exhibition Hall A-B ]
Abstract
We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of $O\left(\frac{\sqrt{KT}}{\sqrt{\epsilon}}\right)$, improving upon the existing upper bound $O\left(\frac{\sqrt{KT \log(KT)}}{\epsilon}\right)$ for all $\epsilon \leq 1$. In particular, our algorithms allow for sublinear expected regret even when $\epsilon \leq \frac{1}{\sqrt{T}}$, establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected regret $O\left(\frac{\sqrt{NT}}{\sqrt{\epsilon}}\right), O\left(\frac{\sqrt{KT\log(N)}\log(KT)}{\epsilon}\right)$, and $\tilde{O}\left(\frac{N^{1/6}K^{1/2}T^{2/3}\log(NT)}{\epsilon ^{1/3}} + \frac{N^{1/2}\log(NT)}{\epsilon}\right)$, where $K$ and $N$ are the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and large $K, N$ and $\epsilon.$
Poster
Clément Lalanne · Franck Iutzeler · Loubes Jean-Michel · Julien Chhor
[ East Exhibition Hall A-B ]
Abstract
Estimating optimal transport maps between two distributions from respective samples is an important element for many machine learning methods. To do so, rather than extending discrete transport maps, it has been shown that estimating the Brenier potential of the transport problem and obtaining a transport map through its gradient is near minimax optimal for smooth problems. In this paper, we investigate the private estimation of such potentials and transport maps with respect to the distribution samples.We propose a differentially private transport map estimator with $L^2$ error at most $n^{-1} \vee n^{-\frac{2 \alpha}{2 \alpha - 2 + d}} \vee (n\epsilon)^{-\frac{2 \alpha}{2 \alpha + d}} $ up do polylog terms where $n$ is the sample size, $\epsilon$ is the desired level of privacy, $\alpha$ is the smoothness of the true transport map, and $d$ is the dimension of the feature space. We also provide a lower bound for the problem.