Registration Desk: Registration West Mon 14 Jul 07:30 a.m.
Registration Desk: Registration East Mon 14 Jul 07:30 a.m.
Meetup: ICML Lounge Area Mon 14 Jul 07:30 a.m.
This meeting room is for ICML delegates to relax and recharge in a comfortable environment.
Expo Talk Panel: Improving LLM Benchmarks: Making AI Work for Real-World Needs Mon 14 Jul 08:00 a.m.
To make AI models truly useful in real-world settings, we need better ways to measure their performance. This talk will focus on how we can improve benchmarks, ensuring LLMs are tested in ways that reflect actual business challenges.
Jonathan will discuss how using real user feedback and industry-specific examples can create more meaningful tests for AI models. We’ll explore ways to measure AI performance based on practical tasks that require applying the model’s conceptual understanding. This will complement the many existing benchmarks that focus on evaluating AI models across a range of conceptual understanding tasks.
By designing evaluation methods that reflect real-world use, we can help bridge the gap between research and business, making AI more effective and reliable in everyday applications.
Expo Talk Panel: Situating principles in context for synthetic data Mon 14 Jul 08:00 a.m.
Codifying context in data represents not just a technical challenge, but a necessary evolution in how we imbue artificial systems with the nuanced understanding that defines human intelligence.
As machine learning systems grow increasingly complex, the demand for high-quality data continues to rise dramatically, particularly in domains where real-world data is scarce or where expert annotations are prohibitively expensive. Despite significant advancements in synthetic data generation techniques, a fundamental challenge persists: synthetic data often lacks the rich contextual dimensions found in naturally occurring data.
Synthetic data generation must evolve beyond non-robust performance metrics to incorporate crucial contextual elements—historical, social, human, and physical—that gives data meaning in real-world applications. Current approaches to synthetic data frequently produce technically valid but contextually impoverished datasets, limiting their effectiveness when deployed in complex environments.
Emerging strategies for codifying context include the use of personas, AI constitutions, value systems, and expert/domain knowledge. One such strategy is Situated Principles (SPRI) framework which demonstrates how context-situated principles, generated dynamically for each input query, can guide large language models to produce responses that align with complex human values without extensive human oversight. This approach suggests task agnostic pathways for embedding contextual richness in synthetic data generation pipelines.
As the field moves from synthetic data toward synthetic experiences—particularly in reinforcement learning environments—the need for contextual fidelity will only intensify. Industry practitioners can bridge the contextual gap in synthetic data generation today while preparing for the more complex challenge of creating nuanced synthetic environments tomorrow.
Expo Talk Panel: Otter: Generating Tests from Issues to Validate SWE Patches Mon 14 Jul 08:00 a.m.
Recent SWE agents generate code to resolve issues. While great for productivity, such systems make good tests even more important. Unfortunately, most prior work on test generation assumes that the code under test already exists. Instead, we are looking at the case where the code patch that resolves the issue has not yet been written. We introduce Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. As of March 9, 2025, Otter is the SOTA for this scenario, topping the SWT-Bench Verified leaderboard.
Expo Talk Panel: The Next Frontier in Enterprise AI: A Vision for Generalist Agents Mon 14 Jul 08:00 a.m.
How can AI revolutionize how businesses use and interact with software at every level? Today’s emerging computer-using generalist agents offer a glimpse of the future, acting as autonomous operators capable of orchestrating tasks across browsers, desktop applications, and enterprise APIs—whether for customer support, data analytics, or regulatory compliance. In this talk, we share IBM’s broader vision for universal, enterprise-ready AI agents that unify end-to-end workflows, seamlessly adapt to complex digital environments, and operate with minimal specialized programming.We highlight how flexible yet robust frameworks, embedded safety and reliability, and far-reaching business impact come together to enable end-to-end automation that reduces operational costs, enhances human productivity, and unlocks entirely new categories of innovation. We also discuss how these agents fit into IBM’s overarching AI roadmap, ensuring alignment with trustworthiness, scalability, smaller models, and human-machine collaboration.
[Attendees will leave with a holistic understanding of the breakthroughs and the challenges ahead—and why universal generalist agents may represent the next great leap in enterprise AI.]
Affinity Workshop: LatinX in AI Mon 14 Jul 08:00 a.m.
The LatinX in AI (LXAI) Workshop will be co-located with ICML 2024, bringing together Latinx researchers, practitioners, and enthusiasts in machine learning to foster collaboration, mentorship, and knowledge exchange.
The LXAI Workshop will feature invited speakers, contributed talks, and poster presentations, highlighting cutting-edge research conducted by Latinx individuals in academia and industry. Additionally, a mentorship session will provide participants with valuable insights into career development, research trends, and pathways for increased Latinx representation in AI.
This event aims to create a welcoming and inclusive environment where Latinx researchers at all career stages can connect, share ideas, and inspire future generations. We strongly encourage participation from underrepresented minorities and allies who support diversity in machine learning and AI.
Affinity Workshop: 4th MusIML workshop at ICML’25 Mon 14 Jul 08:00 a.m.
Muslim in ML Workshop will showcase an inspiring program featuring invited talks, oral presentations, and poster sessions. This event provides a vibrant platform for researchers, practitioners, and students from the Muslim community to connect, exchange ideas, and foster collaborations in the field of machine learning. The workshop aims to celebrate and amplify the contributions of Muslim individuals in machine learning while promoting inclusivity and community engagement. While the workshop highlights contributions from Muslim individuals, people of all backgrounds and faiths are welcome to attend.
Expo Talk Panel: Foundation Models for Automated Trading Mon 14 Jul 08:00 a.m.
Hudson River Trading is a global trading firm and the leader in applying deep learning to financial markets. Every day our models process terabytes of market data from every financial product in the world, and must predict and act robustly under regime shifts, against adversarial participants, and under tight latency constraints. In this talk, we'll describe the problem that researchers at HRT actually solve. We'll introduce the rich but noisy market data that HRT processes and describe the businesses of providing liquidity and price discovery. Using data equivalent to trillions of tokens, we'll talk about the novel modeling, regularization, and engineering challenges we've solved to build the most predictive foundation models for financial markets in the world.
Tutorial: Mingzhi Wang · Chengdong Ma · Yaodong Yang
CANCELED: Alignment Methods for Large Language Models
Large Language Model (LLM) alignment has become an increasingly critical topic in contemporary AI research, especially as LLMs continue to scale and integrate into real-world applications. Ensuring that LLMs generate outputs aligned with human values, preferences, and ethical considerations is essential for their safe and effective deployment. This tutorial aims to provide a comprehensive introduction to LLM alignment methods, offering a structured and accessible entry point for researchers and practitioners interested in the field. It will present key concepts and challenges, introduce fundamental approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), and, building on these foundations, review a spectrum of refinements and variants. In addition, it will cover recent advancements in game-theoretical approach to alignment and theoretical frameworks that provide a deeper understanding of alignment methodologies. Beyond theoretical insights, the tutorial will emphasize the practical aspects of LLM alignment, illustrating how these techniques are applied in real-world scenarios and guiding participants in building intuition about alignment strategies. By the end of the tutorial, attendees will gain a solid foundation in LLM alignment, equipping them with the knowledge needed to critically engage with the field, understand current research trends, and explore future directions.
Bio s:Tutorial: Mark Tygert
There are many different notions of bias and fairness. When comparing subpopulations, an especially important dichotomy is between (1) equal or equitable average outcomes and (2) equal or equitable treatment. In the particular context considered here, "equal treatment" and "equal opportunity" are not too different. However, comparing the average outcome of one subpopulation to another is different and sometimes less desirable than comparing the outcomes of pairs of individuals (one individual from each subpopulation) for which the individuals in each pair are similar. The latter requires comparing outcomes via "conditioning on" or "controlling for" confounding covariates.
Conditioning on or controlling for variates helps compare only those who are comparable. That often means matching up people by their age or income, for example, and then looking at differences in results between people with similar ages or similar incomes. Yet that raises the question: how many people with exactly the same age or exactly the same income are in the data? If there are too few, they will be unrepresentative. When there are too few, the randomness in the results fails to average away. This would seem to call for matching up people whose ages or incomes are only close, but not exactly the same. How close is "close"? Does it matter?
Choosing how close is "close" turns out to make all the difference. In many cases, the data can be made to support any arbitrary conclusion simply by manipulating how close is considered "close." Conventionally, adjusting data for covariates such as age and income often ends up fudging the numbers, spinning facts or figures. Even the well-intentioned are susceptible to confirmation bias, cherry-picking, or otherwise making the data merely confirm expectations.
This tutorial shows how to avoid setting how close is "close." Without any parameter to tune, the tutorial's graphical methods and scalar summary statistics cannot mislead, not even in principle. These methods are thus well-suited for assessing bias, fairness, reliability, the calibration of predicted probabilities, and other treatment effects. The analysis applies to observational studies as well as to randomized controlled trials, including A/B tests. The most common use case is for analyzing the predictions or other outputs of machine-learned models.
Tutorial: Qing Qu · Yuxin Chen · Liyue Shen
Harnessing Low Dimensionality in Diffusion Models: From Theory to Practice
Diffusion models have recently gained attention as a powerful class of deep generative models, achieving state-of-the-art results in data generation tasks. In a nutshell, they are designed to learn an unknown data distribution starting from Gaussian noise, mimicking the process of non-equilibrium thermodynamic diffusion. Despite their outstanding empirical successes, the mathematical and algorithmic foundations of diffusion models remain far from mature. For instance: (i) Generalization: it remains unclear how diffusion models, trained on finite samples, can generate new and meaningful data that differ from the training set; (ii) Efficiency: due to the enormous model capacity and the requirement of many sampling steps, they often suffer from slow training and sampling speeds; (iii) Controllability: it remains computationally challenging and unclear how to guide and control the content generated by diffusion models, raising challenges over controllability and safety, as well as solving inverse problems across many scientific imaging applications.
This tutorial will introduce a mathematical framework for understanding the generalization and improving the efficiency of diffusion models, through exploring the low-dimensional structures in both the data and model. We show how to overcome fundamental barriers to improve the generalization, efficiency, and controllability in developing diffusion models, by exploring how these models adaptively learn underlying data distributions, how to achieve faster convergence at the sampling stage, and unveiling the intrinsic properties of the learned denoiser. Leveraging the theoretical studies, we will show how to effectively employ these properties for controlling the generation of diffusion models.
Expo Talk Panel: Structured Foundation Models Meets AutoML: Shattering the SOTA with AutoGluon & GraphStorm Mon 14 Jul 09:30 a.m.
Real-world data is messy, heterogeneous, and increasingly complex. Simultaneously, production systems must operate at scale with consistent performance. This paradox creates a significant challenge: how do we build sophisticated models that can handle complex data while maintaining production reliability? We present AWS's OSS advancements to bridge this gap by automating critical but time-consuming aspects of the ML pipeline. By providing low-code, easy-to-use frameworks that can handle tabular, graph, time series, and multi-modal data, we're democratizing access to sophisticated ML capabilities. This means businesses of all sizes - not just tech giants - can leverage ML for competitive advantage. In this talk, we will showcase state-of-the-art algorithms and research advancements, such as techniques for automatic graph construction from tabular data and efficient tabular model selection. Furthermore, we will share our recent approaches to push the boundaries of the AutoML domain with AutoGluon, including the integration of foundation models for improved time series forecasting and tabular data prediction, and an LLM-powered agent system for automated data science.
Tutorial: Amy Zhang · Benjamin Eysenbach
Generative AI Meets Reinforcement Learning
This tutorial explores the intersection of generative AI and reinforcement learning, demonstrating how generative models can be understood as RL agents and environments, and conversely, how RL can be viewed as generative modeling. It aims to bridge the gap between these fields, showing how insights from each can enhance the other. The workshop will cover topics such as reinterpreting generative AI training through an RL lens, adapting generative AI to build new RL algorithms, and understanding how AI agents interacting with tools and humans create a new generative model. It will also discuss future directions and open problems, focusing on how RL can shape the future of foundation model training and enable generative AI systems to construct their own knowledge.
Bio s:Tutorial: Dmitry Krotov · Benjamin Hoover · Parikshit Ram
Modern Methods in Associative Memory
Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.
Bio s:Tutorial: Natalia Ponomareva · Sergei Vassilvitskii · Peter Kairouz · Alex Bie
DP-fy your DATA: How to (and why) synthesize Differentially Private Synthetic Data
This tutorial focuses on the increasingly important area of differentially private (DP) synthetic data generation, addressing the need for robust anonymization in machine learning. Creating DP synthetic data allows for data sharing and analysis without compromising individuals' privacy, opening up possibilities for collaborative research and model training. The tutorial aims to bridge the gap between various related fields, such as DP training, DP inference, and empirical privacy testing, providing a comprehensive guide for generating DP synthetic data across different data types.
The tutorial will cover various aspects of DP synthetic data generation, starting with an introduction to different types of synthetic data and their benefits. It will then provide a brief overview of differential privacy, focusing on the key concepts needed to understand the subsequent sections. The core of the tutorial will delve into specific methods for generating DP synthetic data for tabular, image, and text data, with a significant emphasis on text data generation. The tutorial will elaborate on main components of a DP synthetic data generation system including what privacy guarantees to aim for, and what contribution constraints to apply on the data. It will also review best practices for handling sensitive data, and empirical privacy testing. Finally, the tutorial will conclude with a discussion of open questions and challenges in the field.

Tutorial: Jiaoda Li · Ryan Cotterell · Franz Nowak · Anej Svete
The Underlying Logic of Language Models
The formal basis of the theory of computation lies in the study of languages, subsets of Σ, the set of all strings over an alphabet Σ. Models of computation can be taxonomized into the languages they can decide on, i.e., which languages a model can be used to determine membership of. For instance, finite-state automata can decide membership in the regular languages. Language models are probabilistic generalizations of language where the notion of a set is relaxed into one of a probability distribution over Σ. Recently, language models parameterized using recurrent neural networks, transformers, and state-space models have achieved enormous success in natural language processing. Similarly to how theorists have taxonomized models of deterministic computation, researchers have been made to taxonomize the expressivity of language models based on various architectures in terms of the distributions over strings they can represent. This tutorial presents a self-contained overview of the formal methods used to taxonomize the expressivity of language models, which encompass formal language and automata theory, various forms of formal logic, circuit complexity, and programming languages such as RASP. For example, we illustrate how transformers, under varying assumptions, can be characterized by different fragments of formal logic.
Bio s:
Tutorial: qiang liu
Flowing Through Continuous-Time Generative Models: A Clear and Systematic Tour
Continuous-time generative models—particularly diffusion- and flow-based models—have emerged as a dominant paradigm in generative AI, with applications in image, video, molecular, and audio synthesis, as well as scientific modeling. Despite their success, the field’s rich mathematical structure, varied terminology, and subtle theoretical foundations often lead to confusion and fragmented understanding.
This tutorial offers a clear, unified, and accessible introduction to continuous-time generative models. Beginning with the simplified lens of rectified flow, we build a streamlined conceptual framework to support systematic exploration of the algorithmic landscape, while minimizing unnecessary mathematical overhead. We clarify commonly confused ideas and untangle key relationships—such as flow vs. diffusion, and the interplay between interpolation, noise schedules, and samplers. We also touch on advanced topics including distillation, control, and discrete and constrained generation in flow and diffusion models.
Tutorial: Leena Chennuru Vankadara · Volkan Cevher
Training Neural Networks at Any Scale
At the heart of deep learning’s transformative impact lies the concept of scale--encompassing both data and computational resources, as well as their interaction with neural network architectures.
Scale, however, presents critical challenges, such as increased instability during training and prohibitively expensive model-specific tuning. Given the substantial resources required to train such models, formulating high-confidence scaling hypotheses backed by rigorous theoretical research has become paramount. The first part of the tutorial will provide an overview of significant advances in the theory of scaling in deep learning, covering its historical foundations, recent breakthroughs, and practical implications for training large-scale models.
To bridge theory and practice, the tutorial explores another key mathematical ingredient of scaling: the numerical solution algorithms commonly employed in deep learning, spanning domains from vision to language models. We unify these algorithms under a common master template, making their foundational principles transparent. In doing so, we reveal the interplay between adaptation to smoothness structures via online learning and the exploitation of optimization geometry through non-Euclidean norms.
Our exposition moves beyond simply building larger models--it emphasizes strategic scaling, offering insights that promise to advance the field while economizing on resources.
Tutorial: Hamed Hassani · Amin Karbasi · Alexander Robey
Jailbreaking LLMs and Agentic Systems: Attacks, Defenses, and Evaluations
Since the inception of language models (LLMs), considerable attention has been directed toward the field of AI safety. These efforts aim to identify a range of best practices—including evaluation protocols, defense algorithms, and content filters—that facilitate the ethical, trustworthy, and reliable deployment of LLMs and related technologies. A key component of AI safety is model alignment, a broad concept referring to algorithms that optimize the outputs of LLMs to align with human values. And yet, despite these efforts, recent research has identified several failure modes—referred to as jailbreaks—that circumvent LLM alignment by eliciting unsafe content from a targeted model. And while initial jailbreaks targeted the generation of harmful information (e.g., copyrighted or illegal material), modern attacks seek to elicit domain-specific harms, such as digital agents violating user privacy and LLM-controlled robots performing harmful actions in the physical world. In the worst case, future attacks may target self-replication or power seeking behaviors. The insidious nature of jailbreaking attacks represents a substantial obstacle to the broad adoption of LLMs. Therefore, it is critical for the machine learning community to study these failure modes and develop effective defense strategies that counteract them.
Over the past two years, research in both academia and industry has sought to design new attacks that stress test model safeguards, and to develop stronger defenses against these attacks. And, in general, this concerted work has resulted in safer models. Notably, highly performant LLMs such as OpenAI's o-series and Anthropic's Claude3 models demonstrate significant robustness against a variety of jailbreaking attacks. However, the evolving arms race between jailbreaking attacks and defenses demonstrates that meeting acceptable standards of safety remains a work in progress. To provide a comprehensive overview of the evolving landscape in this field, this tutorial aims to present a unified perspective on recent progress in the jailbreaking community. In line with this goal, the primary objectives of this tutorial are as follows: (i) We will review \textit{cutting-edge advances} in jailbreaking, covering new algorithmic frameworks and mathematical foundations, with particular emphasis on attack, defenses, evaluations, and applications in robotics and agentic systems; (ii) Noting that the foundations of jailbreaking are still at their infancy, we will discuss the plethora of \emph{new directions, opportunities, and challenges} that have recently been brought to bear by the identification of jailbreaking attacks; (iii) We will walk through a range of open-source Python implementations of state-of-the-art algorithms.


Tutorial: Aaditya Ramdas
Game-theoretic Statistics and Sequential Anytime-Valid Inference
Sequential anytime-valid inference (SAVI) provides measures of statistical evidence and uncertainty --- e-values and e-processes for testing and confidence sequences for estimation --- that remain valid at all stopping times. These allow for continuous monitoring and analysis of accumulating data and optional stopping for any reason. These methods crucially rely on nonnegative martingales, which are wealth processes of a player in a betting game, thus yielding the area of "game-theoretic statistics". This tutorial will present the game-theoretic philosophy, intuition, language and mathematics behind SAVI, summarized in a recent a new book https://arxiv.org/pdf/2410.23614, to be published before ICML as the first edition of the new book series, Foundations and Trends in Statistics.
Bio :
Tutorial: Ziyu Yao · Daking Rai
Tutorial on Mechanistic Interpretability for Language Models
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will particularly be presented following an innovative Beginner's Roadmap that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.
Bio s:
Exhibit Hall: Exhibits Mon 14 Jul 04:00 p.m.
Expo Demonstration: WCA4Z Platform - Accelerating Legacy Code Modernization Using AI Agents, Deep Program Analysis, and Scalable AI Pipelines Mon 14 Jul 04:00 p.m.
Modernizing legacy mainframe applications is critical for enterprises seeking agility, security, and competitiveness. While these systems remain robust, they are becoming increasingly difficult to maintain due to skill shortages and integration complexity. We present WCA4Z, an AI-powered framework for understanding and transforming legacy code. WCA4Z combines static analysis, multi-agent systems, and large language models to automate key stages of the modernization lifecycle, including semantic code comprehension, modular refactoring, COBOL-to-Java transformation, and equivalence validation. The framework also offers strong support for LLM lifecycle management through configurable, reproducible workflows and an integrated dashboard.
Expo Demonstration: Prompt Declaration Language (PDL) Mon 14 Jul 04:00 p.m.
Programming with LLMs requires careful orchestration of prompts with workflow logic and agentic patterns. Unfortunately, previous frameworks rely on deeply buried prompts that work well for a specific model and pattern but are difficult to adapt to new settings. We designed PDL, a new programming language for LLMs. PDL keeps prompts at the forefront using YAML, with a small set of simple but powerful logic blocks to assemble workflows or agents. Prompt contexts are accumulated implicitly, simplifying model chaining. PDL comes with visual tools for observability and experimentation and the implementation automatically parallelizes model calls. PDL supports a variety of model providers, including but not limited to IBM's watsonx-ai with Granite models and the new granite-io library.
Expo Demonstration: Building Advanced Software Engineering LLM Agents with Codellm-Devkit (CLDK) Mon 14 Jul 04:00 p.m.
Large Language Model (LLM) agents are increasingly being used for code intelligence, automated reasoning, and software analysis. However, their effectiveness depends on their ability to perform deep, structured analysis across multiple programming languages. This workshop will introduce CLDK (CodeLLM DevKit), a powerful multilingual static and dynamic analysis framework designed to supercharge LLM-based coding agents. The demos will illustrate how CLDK enables fine-grained program understanding, reasoning, and transformation across diverse programming languages. We will demonstrate how CLDK can be integrated with LLM agents to enhance tasks such as code refactoring, automated debugging, test generation, and vulnerability detection. The workshop will feature interactive discussions, live demos, and hands-on exercises to help researchers and developers build next-generation AI- powered software agents.
Expo Demonstration: Excitement-Driven AI Sports Commentary Generation Mon 14 Jul 04:00 p.m.
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody, which many existing works fail to accomplish satisfactorily. We propose a speech language model that explicitly represents the prosody information and its relationship with text and thus is surprisingly capable of generating expressive speech appropriate to the context.
In this demo, we combine our speech modeling technology with multi-modal language models into an expressive AI sports commentary generation system. The system analyzes tennis game videos and generates expressive play-by-play speech commentary. Notably, the system can detect the excitement level of the play from crowd and player reactions and adjust the excitement level of the generated speech accordingly.
Expo Demonstration: Production Strength Sales Agent Mon 14 Jul 04:00 p.m.
AI Agents have been growing in significance as the next generation technology to enable increase in user productivity. However, state of the art agents have been restricted to mostly proof of concepts and great demos. We take a step towards production ready AI agents by showcasing a number of innovative agentic middleware components. These components integrate seamlessly along with the LLM to improve the robustness and efficiency of the Agent.
Expo Demonstration: IBM software agent for code (ISAC) Mon 14 Jul 04:00 p.m.
Resolving issues from an issue tracker on a source-code repository is tedious and expensive when done by hand. Recently, the SWE-bench leaderboard has seen submissions by several LLM-based agents that do this automatically. Unfortunately, these agents rely on closed-source frontier models, making them expensive and raising data-sharing concerns for industrial use. In contrast, we built IBM software agent for code, which works with a variety of open-source models such as Llama, Granite, and Mistral. ISAC uses sub-agents that are specialized for sub-tasks of localization, editing, and testing. Each sub-task is within reach of the capabilities of an open-source model. Furthermore, ISAC uses automated checking and repair of various common mistakes made by models, uses structured formats for data passed between sub-agents, and uses ensembling at multiple levels.
Expo Demonstration: Knobs of the Mind: Dopamine, Serotonin, and a Maze-Running Rover Mon 14 Jul 04:00 p.m.
Modern deep-RL policies crack under distribution shifts because every new environment demands another slog of back-prop. We flip the script: first train once, then lock every weight and steer behaviour with three neuromodulatory “mood knobs.”
Dopamine-like reward gain fires up or damps down the urge to chase pay-offs.
Serotonin 5-HT2-like exploration gain widens or narrows the agent’s repertoire.
Serotonin 5-HT1-like risk penalty injects real-time caution when danger spikes.
These scalars mimic the way real neuromodulators gate cortical circuits: they change a neuron’s responsiveness in milliseconds without rewriting the synapse. That gives us a clean separation between slow structural learning (the frozen network) and fast functional adaptation (the gains). Shifting the knobs costs almost nothing computationally yet lets one policy jump across grid mazes, procedurally generated dungeons, and even onto a Jetson-Nano quadruped robot dog - all while dodging the usual “catastrophic forgetting” trap.
The takeaway: treating reinforcement learning agents like brains - plastic weights plus fluid neuro-chemistry - delivers instant, reversible behavioural tuning and makes real-world deployment far less brittle.
Expo Demonstration: Real-World Autonomy: Building Modular, Voice-Guided Embodied Agents with SLMs and Vision Mon 14 Jul 04:00 p.m.
We present a new approach to embodied intelligence—one grounded in modular AI systems - combining small language models (SLMs), vision models, and speech interfaces. This architecture enables fast, intuitive agent behavior—even in low-resource, real-world environments.
Our prototype, an AI-powered exoskeleton, performs physical tasks through natural human interaction. It operates in three modes: Shadow (mimic gestures), Command(respond to voice), and Training (learn by demonstration). High-level reasoning is handled by SLMs, while fast, modular controllers manage low-level control.
This approach removes the need for heavy simulations and makes it easier for engineers and researchers to build real-world systems with limited resources.
Expo Workshop: Uncertainty Estimation in LLM-Generated Content Mon 14 Jul 04:30 p.m.
The ability of Large Language Models (LLMs) to accurately estimate uncertainty is not just a theoretical concern; it’s a fundamental bottleneck hindering their safe and effective deployment in high-stakes, industrial-scale applications. The gap between model confidence and actual correctness poses an immediate and escalating risk. To mitigate these risks, this workshop convenes leading industry experts and academic researchers to confront the urgent challenges in LLM uncertainty estimation. We must define the state-of-the-art, establish rigorous evaluation standards, and forge a path toward reliable AI. This workshop will focus on:
- Calibration: How can we ensure LLMs’ confidence levels align with their true accuracy?
- Confidence-Aware Generation: What novel methods can enable LLMs to express their own uncertainty during content creation?
- Out-of-Distribution Detection: How do we equip LLMs to recognize and flag inputs that lie outside their training data?
- Uncertainty Communication: What are the most effective techniques for conveying LLM uncertainty to end-users, fostering trust and informed decision-making?
- Benchmarking: What are the various metrics to measure how well models and express quantify uncertainty.
The insights and collaborations generated here will directly shape the future of LLM development and deployment.
Expo Talk Panel: Graph Foundation Models: Thoughts and Results Mon 14 Jul 04:30 p.m.
This will be a 30-60 minute presentation covering our ongoing work to generalize graph learning models across tasks. We’ll provide an overview of Graph Foundation Models (GFMs), defining them as single models designed to learn transferable representations for generalization across diverse graphs and tasks, contrasting them with traditional graph learning approaches. Then we’ll discuss the motivation for GFMs advocating for the need for transferable learning and generalization. We’ll highlight successful GFM examples in link prediction and node classification, while also acknowledging open challenges such as feature heterogeneity and task generalization. Finally, we’ll briefly explore the intersection of GFMs and Large Language Models (LLMs), including text-space approaches and enhancing LLM reasoning with graph structures.
We expect that this Expo will be of broad interest to ICML attendees.
All presenters are experts currently working in this area.
Bryan Perozzi has been working on graph machine learning for 10+ years, and has 20000+ citations in the area.
Michael Galkin has 2800 citations and an h-index of 28.
Expo Workshop: AI in Finance: Innovation & Emerging Opportunities Mon 14 Jul 04:30 p.m.
This dynamic 90-minute session features a series of engaging lightning talks showcasing the forefront of AI and Machine Learning within the financial services industry. Discover novel in-house innovations, including Grembe, a Capital One-developed system leveraging graph embeddings on transactional data for enhanced financial understanding across fraud detection and customer behavior modeling. Explore MACAW, an Advanced Quantitative Method employing multi-agentic Large Language Model workflows to tackle complex financial reasoning. Learn about our research on Fortifying Conversational AI through intelligent input guardrails that enhance the security and reliability of LLM-driven interactions. A key highlight will be spotlights on emerging research and innovative concepts from our academic partners as they address critical challenges in AI for finance. We aim to feature several distinct examples of this cutting-edge work, providing insights into novel approaches and potential future directions. Examples of explored areas include trustworthy and responsible AI (e.g., with Columbia University), pioneering reliable and ethical decision-making (e.g., with USC), and building world models for financial decision-making (e.g., with the University of Maryland). This session will provide ICML participants with a concise yet comprehensive overview of impactful AI research and applications in finance, fostering conversation and dialogue about notable advances, ongoing research, and critical challenges shaping the future of AI in this domain.
Expo Talk Panel: Building Production Ready Agentic Systems: Architecture, LLM-based Evaluation, and GRPO Training Mon 14 Jul 04:30 p.m.
In this talk we will discuss how we leverage the latest LLM and agentic patterns to create a Shopify assistant, Sidekick, with multiple skills to perform actions on the Shopify platform on behalf of merchants. We’ll talk about curation of datasets, tooling, MCP, post-training techniques (SFT and reinforcement learning with GRPO), prompting, structured generation through CFG, agent evaluation, experimentation, and even peek down several briefly-explored rabbit holes. We will also demonstrate the orchestration of the models, systems, and solutions that serve and improve Sidekick as it is currently offered to Shopify merchants.
Expo Workshop: Evaluation of GenAI models Mon 14 Jul 04:30 p.m.
This workshop will explore cutting-edge research in evaluating and ensuring the trustworthiness of Generative AI, including Large Language Models (LLMs) and Diffusion Models. As these models become increasingly integrated into decision-making, robust evaluation is crucial. We'll delve into diverse strategies for building more reliable Generative AI across various applications. Topics include: • Holistic Evaluation: Datasets, metrics, and methodologies. • Trustworthiness: o Truthfulness: Addressing misinformation, hallucinations, inconsistencies, and biases. o Safety & Security: Preventing harmful and toxic content, and protecting privacy. o Ethics: Aligning with social norms, values, regulations, and laws. • User-Centric Assessment: Evaluating models from a user perspective. • Multi-Perspective Evaluation: Focusing on reasoning, knowledge, problem-solving, and user alignment. • Cross-Modal Evaluation: Integrating text, image, audio, and other modalities. This workshop aims to bring together researchers from machine learning, data mining, and related fields to foster interdisciplinary collaboration. Through invited talks, paper presentations, and panel discussions, we aim to share insights and spark collaborations between academia and industry. Researchers from various fields, including Data Mining, Machine Learning, NLP, and Information Retrieval, are encouraged to participate.