VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
A standard approach to representing a video is via a fixed spatiotemporal grid of tokens corresponding to the original 3D structure of the signal. These tokenization approaches, however, result in a fixed-length token sequence that is independent of the underlying input complexity. In addition, this grid structure biases tokens to focus on and capture local information from the original signal. In this work, we develop a tokenizer that learns to represent an input video in a coarse-to-fine manner, where early tokens encode the most salient semantic features of the whole video, while later tokens incrementally refine the representation with more fine-grained details. Additionally, we introduce an autoregressive temporal loss over the learned tokens that serves two purposes: first, it makes the tokens more suitable for subsequent autoregressive video modeling; second, it encourages the learning of higher-level abstractions that are more predictable over time. We study the representations learned through this process and evaluate their usefulness for downstream applications such as video modeling.
Strands Robots: Unifying Robot Control, Simulation, and Training Behind Natural Language
Despite rapid advances in vision-language-action (VLA) models, deploying robot intelligence remains fragmented: different SDKs for different robots, different policy frameworks with incompatible interfaces, an unbridged simulation-to-reality gap, and training pipelines that demand specialist expertise. We present Strands Robots, an open-source Python SDK that unifies the complete robot lifecycle—simulation, control, training, and deployment—behind natural language. Our central contribution is a Policy abstraction layer with a plugin registry supporting 18 VLA/WFM providers (50+ aliases) under a single three-method interface, enabling zero-code-change transfer between simulation and real hardware. We scale this abstraction across three axes: robot diversity (35 bundled models from MuJoCo Menagerie spanning arms, humanoids, quadrupeds, and dexterous hands), simulation fidelity (three backends: MuJoCo CPU, Newton GPU-differentiable with 4,096+ parallel environments, and Isaac Sim with RTX rendering), and policy ecosystem breadth (from 42M-parameter ONNX humanoid controllers at 135 Hz to 14B-parameter world action models). Built on the Strands Agents framework, every capability is exposed as a tool callable via natural language, enabling AI agents to autonomously design scenes, run experiments, collect data, train policies, and deploy to hardware. We demonstrate that this unified approach achieves practical results: GEAR-SONIC humanoid whole-body control at 135 Hz on Jetson AGX Thor, and seamless integration with NVIDIA's GR00T N1.7, Cosmos 3. Strands Robots establishes a practical foundation for autonomous robot development where the barrier between idea and physical action is a single line of Python.
Multi-Agent System Design and Evaluation for Quantitative Finance
Quantitative finance imposes constraints that stress-test general-purpose agent architectures: data is non-stationary, latency budgets are tight, and subtle errors in temporal reasoning can invalidate an entire research pipeline. At Jump Trading, we build multi-agent systems that operate under these constraints across thousands of instruments and terabytes of daily market data, searching for structure in a regime characterized by extremely low signal-to-noise ratios and adversarial selection against all but the most rigorously designed strategies.
In this talk, we present results from firmwide and trading-specific benchmarks evaluating multi-agent architectures. Starting from a baseline of frontier single-agent systems in commonly used terminal harnesses, we compare variations across harness design, context management, inter-agent communication, parallel execution, and post-training for task specialization. We characterize the architectural choices in task decomposition, context scoping, and workflow structure that justify the additional complexity of harness design and post-training, highlighting environments where multi-agent systems and domain-focused subagents outperform a single context-rich frontier model. Finally, we discuss methods to improve evaluation quality in complex domains such as quantitative finance where proprietary data, scarce human labels, and heterogeneous composition of both tasks and technical environments preclude reliance on publicly available benchmarks. Additionally, we present critical steps towards deriving information-theoretic bounds as a function of entropy that guide the convergence of agent-based processes.
Frontiers in Evaluation, Rewards, and Agent Environments
As agents move toward real-world tasks with economic impact, evaluation and reward design are becoming increasingly complex. Scale AI will share insights from its research into recent trends at the frontier of evaluation, reward design, and agent environments. In particular, LLM evaluation is critical to model development: it defines the direction of improvement and unlocks RL scaling through automated feedback.
Unifying Attention and Diffusion with Kan Extension Transformers: Structured Deep Learning with Diagrammatic Backpropagation
Modern foundation models are powerful, but their representations, training dynamics, and agentic workflows remain difficult to audit, compose, and trust. This tutorial presents a categorical and geometric framework for trustworthy foundation-model systems. The major scientific components of the tutorial include
- Diagrammatic Backpropagation (DB), which generalizes deep learning to include curvature loss function over categorical diagrams
- Infinitesimal Causality (IC), which generalizes the chain rule in calculus to functors in tangent categories
- Kan Extension Transformers (KET), which define a structured computation substrate, unifying attention and diffusion, and providing a universal machine learning framework for mapping finite experience into infinite futures
- Universal Decision Learning (UDL), which is a rigorous categorical framework for building foundries, or building blocks of foundation models
- Lie-algebra based neural adapters (ALLORA), which shows how to compose LoRa adapters by detecting non-commutativity using Lie-Brackets
- Agentic skill optimization using Lie Algebroids(LASKO), which formalizes optimization over tangent Markdown categories
- Odyssey: a demonstration system for automatic foundry construction.
The tutorial is designed as a conceptual 2.5-hour overview. Technical details are deferred to associated arXiv papers and the Categories for AGI book. Participants will leave with a solid understanding of a powerful categorical and geometric design language for foundation-model systems that learn locally, transfer cautiously, expose obstructions, and glue global conclusions only when the evidence permits.
Diffusion and Flow-Matching: From Memorization to Generalization & Beyond
Unlearning Data at Scale
Probabilistic Numerics — Computation is Machine Learning
Machine learning is the process of estimating latent representations or variables from finite data. If the data is insufficient, this inference process leaves a finite estimation error. Probabilistic (Bayesian) machine learning attempts to capture this empirical uncertainty in a probability distribution.
But what actually happens inside of a Learning Machine, the computational side of ML, is invariably the solution of a numerical problem: Optimisation for deep learning, solving differential equations for diffusion, flow matching, and scientific simulation, or even just (large-scale, approximate) numerical linear algebra. These numerical tasks have no analytic solution in reach. The computational resources are insufficient, and so the computation leaves a finite computational error. Probabilistic numerical methods attempt to capture this computational uncertainty in a probability distribution.
By matching the mathematical modelling language of the empirical and the computational side of machine learning in this way, probabilistic numerical methods open new opportunities for computational savings, and new functionality in the ML stack: Computational and data uncertainty can be controlled in relation to each other, and information from data can flow "backwards" through a computation to solve inverse problems. A growing research community within ML is developing this toolchain, typically by building on established, highly efficient, classic numerical methods.
The tutorial is split in three parts. We will start with a simple worked example to establish key concepts and patterns. A second part will generalise these insights into a design pattern across a large class of numerical tasks. Finally, a hands-on code demo will demonstrate how probabilistic numerical methods work in practice.
Proving Theorems with Lean and Machine Learning
AI agents can now write mathematics, including proofs of theorems relevant to Machine Learning, but we can’t trust them yet. Subtle errors might be hidden deep in the reasoning steps, and checking the proofs manually takes a lot of time and expertise.
The Lean theorem prover provides a way to write formal, machine-checkable proofs, giving us high confidence in their correctness. AI systems have managed to reach gold medal level at the International Mathematical Olympiad while producing Lean-checked proofs. Could we get them to write research-level, verified mathematics?
In this tutorial, we introduce Lean and its mathematical library Mathlib, and show how they can be used to write trusted proofs, in particular machine learning theory proofs. We then show how machine learning can help with theorem proving, and present recent advances in AI-assisted formalization.
Seoul World Model: Grounding World Simulation Models in a Real-World Metropolis
What if world simulation models could operate not in imagined environments, but directly on real, living cities?
In this talk, we present Seoul World Model (SWM), a city-scale world simulation model grounded in real-world geospatial data. Unlike prior world models that generate plausible yet fictional environments, SWM leverages large-scale street-view imagery and retrieval-augmented generation to produce temporally consistent, spatially faithful simulations of an actual metropolis.
We will briefly introduce the core technical ideas behind SWM -- including cross-temporal pairing, synthetic trajectory augmentation, and the Virtual Lookahead Sink for long-horizon stability -- and demonstrate how these enable controllable, kilometer-scale simulation with realistic geometry, motion, and user-driven scenarios.
Beyond the technical contribution, this session aims to situate SWM within the rapidly evolving landscape of world models and physical AI. We will discuss concurrent efforts from both academia and industry, highlighting key differences between imagined-world generation and real-world grounded simulation.
The session will conclude with a panel discussion focusing on:
* Open research challenges, including dynamic object modeling, data quality, and scaling real-world grounding
* Business opportunities for companies operating large-scale street-view or map platforms (e.g., simulation-as-a-service, digital twins, autonomous driving data generation)
* Strategic implications of geospatial foundation models in the era of sovereign AI
By bringing together perspectives from research and industry, this session explores how world models can evolve from generative media technologies into core infrastructure for understanding and simulating the physical world.
The End-to-End AI Scientist: Automating Discovery and the Research Pipeline
Deploying AI for highly complex, open-ended discovery tasks exposes the limitations of current context windows and reasoning capabilities. This talk explores our latest multi-agent system designed to automate the deep scientific research pipeline. We will detail the "AI Scientist," an end-to-end architecture that autonomously generates novel hypotheses, translates concepts into experimental ML pipelines, and synthesizes verifiable research papers. We will unpack the design of specialized sub-agents, particularly focusing on our expert reviewer agent (ScholarPeer) which uses a novel "historian" approach to verify true novelty against historical literature, and the automated visual generator (PaperBanana) which tackles the complex multi-modal challenge of synthesizing accurate methodology figures. Attendees will gain insights into how we orchestrate and evaluate these workflows to reliably accelerate the pace of AI discovery.
From Thinking to Doing: Design Principles for Scaling Native Agent Capabilities with the Open-Source MiMo-V2.5 Model Family
The competitive frontier of foundation models has shifted from single-turn reasoning to sustained autonomous execution. The central question is no longer how well a model thinks, but whether it can operate as a reliable agent — maintaining coherence across thousands of decision steps, coordinating multimodal perception and action, and doing so within practical cost budgets. This talk distills three transferable design principles from our experience building and deploying the MiMo-V2.5 open-source model family.
Long-horizon stability requires architectural guarantees. A hybrid sliding-window / global attention scheme (6:1 ratio) compresses KV-cache by ~7× and enables native million-token context. MiMo-V2.5-Pro (1.02T / 42B active) sustains coherent trajectories over nearly 2,000 tool calls — autonomously completing a full SysY compiler in Rust (4.3 h) and an 8,192-line video editor (11.5 h), both passing all tests on first submission.
Token efficiency is the binding constraint on deployability. Architectural compression, 3-layer multi-token prediction, and Multi-Teacher On-Policy Distillation (MOPD) jointly yield 40–60% token savings over frontier models at matched performance (SWE-Bench Verified 78.9, TerminalBench 2.0 68.4).
Omnimodality closes the perception-expression loop. MiMo-V2.5 (310B / 15B active, 48T training tokens) unifies vision, audio, and language in a single sparse MoE. MiMo-V2.5-TTS enables instruction-steered emotion and timbre control with zero-shot voice design and few-second cloning. MiMo-V2.5-ASR achieves state-of-the-art recognition across dialects, code-switching, and noisy conditions via RL-augmented training.
All models are MIT-licensed. We share the trade-offs, failure modes, and scaling lessons behind each principle, and conclude with open questions toward agents capable of narrative-level planning and closed-loop embodied action.
vLLM-Hook: Live Programming of Model Internals on vLLM
vLLM-Hook is a modular plug-in library for vLLM that lets developers and researchers inspect, analyze, and intervene on internal model states during inference. The talk will present the core design of vLLM-Hook, including its configuration-driven hook interface, support for passive programming and active programming, and compatibility with practical deployment workflows. We will show how the system exposes internal signals such as attentions, attention heads, and activations, and how these signals can be used for real-time monitoring and controlled intervention without requiring model retraining. The session will highlight three concrete use cases from the project: prompt-injection detection through in-model monitoring, retrieval enhancement through selective retrieval and reranking signals, and activation steering for controlled generation. The goal of the talk is to give practitioners a clear view of how model-internal programming can become a practical capability in modern LLM serving stacks built on vLLM.
Model Optimization Flywheel: Continuously Self-Improving LLMs in Production
We present Shopify's Model Optimization Flywheel, a practical methodology for turning frontier-quality LLM behavior into faster, cheaper, and continuously improving production systems. The flywheel starts with reliable evaluation: LLM-as-judge evaluators grounded in human-labeled data become the canonical metrics for prompt optimization, distillation, and production regressions.
Using Tangle-powered experimentation workflows, we optimize frontier-model system prompts, collect training data from production A/B traffic and synthetic merchant/user rollouts, and distill smaller models with SFT, on-policy distillation, and GRPO. These models can replicate, and in some cases exceed, frontier-model behavior at much lower serving cost. We then compress prompts with gist tokens to reduce context overhead and improve latency.After deployment, the loop continues by sampling low-scoring production conversations, using stronger reasoning models to critique and "heal" them, folding repaired examples back into training, and re-running distillation. This flywheel has reduced serving cost and latency while improving production quality. We will share concrete recipes, quality-cost-latency trade-offs, and a blueprint for building self-improving LLM systems that get better and cheaper over time.
Tangent: Autonomous Auto-Research Agent for ML Pipelines
Tangent is an autonomous ML research agent built on Tangle, an open-source cross-cloud ML pipeline orchestrator (available on HuggingFace). Unlike hyperparameter search tools, Tangent automates the full iterative research cycle: given a goal — reproduce a result, fine-tune a model, push a metric — it designs a Tangle pipeline, selects components and hyperparameters, launches runs on CPU or GPU clusters, diagnoses its own failures, and iterates, with or without a human in the loop. Findings are captured in a persistent learnings corpus so each experiment builds on the last. In production use at Shopify, Tangent has reduced ML experiment cycles from ~3-7 days to ~1 day. We will demo Tangent live: starting from a research prompt, watching the agent build and submit a pipeline to a real compute cluster, and discussing results as they arrive.
AI for International E-Commerce: Generative Recommendation, Conversational Shopping Agents, and Multimodal Supply Chain Models
We will present 3 AI systems for international e-commerce, deployed across AliExpress, Lazada, and supply chain. (1) Generative Recommendation at AliExpress — A generative recommendation system that reshapes traditional product discovery pipelines with generative AI techniques. It delivers substantial improvements in core business metrics, system compute utilization, provides highly adaptable support for diverse business requirements and unlocks richer shopping experiences. Attendees can interactively input queries categories to see real-time AI-generated personalized recommendations. (2) LazzieChat at Lazada — An AI chat assistant and intelligent shopping guide system integrating large models, agents, multimodal understanding, and e-commerce knowledge. Built around "user intent recognition" and "intent fulfillment," it covers product comparison, bundling, alternative recommendations, and list recommendations. It addresses long-tail queries through a Query Intent Rewrite Agent, Attendees will engage in multi-turn conversations, test visual search by uploading images, and experience localization features for Southeast Asian users. (3) Supply Chain Multi-Objective Multimodal Large Model with RLVR at AIDC — A technical presentation on a unified VLM architecture addressing "physical perception defocus" and "industry logic gap" in logistics tasks (weight estimation, dimension estimation, HSCode prediction, seasonality classification, logistics attributes, foldability discrimination). We will present the hybrid stratified resampling strategy, SFT + RLVR joint training, multi-dimensional dynamic weighted reward function, and GDPO algorithm that severs spurious causal chains. We will share experimental results showing significant improvements over Qwen closed-source model across all six tasks, and introduce the "Season Identification Agent" combining human-machine collaboration with temporal sales features.
Show Your Work: Real-Time, Per-Turn Requirement Validation in an Open-Source Voice Assistant
We are demonstrating a live voice assistant, built on open IBM Granite 4.1 models, that lets ICML attendees watch a language model check its own work in real time, turn by turn, during a natural spoken conversation. Attendees walk up, speak to the assistant, and watch a panel beside it light up as the system generates several candidate responses in parallel and scores each one against a set of plain-English requirements: how the answer should sound, how long it should be, what it shouldn't say. Passing requirements turn green, failures turn red, and the first candidate that satisfies all of them is spoken back. Failures are shown, not hidden. Attendees can edit the requirements on the fly and hear the assistant's behavior change mid-conversation. The demonstration is hands-on and built for a research audience working on validated and controllable generation. Every piece of it (models, orchestration, frontend) is Apache-2.0 and runs on a single laptop with no external API, so any attendee can reproduce it after the session.
Fara1.5: A Family of Computer Use Agents
We present Fara1.5, a family of lightweight Computer Use Agent (CUA) models at three scales 4B, 9B, and 27B -- each achieving state-of-the-art results on browser-use benchmarks among models of comparable size (e.g., 56.6% on Online-Mind2Web and 84.9% on WebVoyager for Fara1.5-9B). These models are the next evolution of Fara-7B and make many advancements, such as more robust user interaction/oversight, more efficient task execution, and overall stronger task completion performance. The key driver behind these models’ capabilities is our synthetic data generation pipeline, FaraGen1.5, that builds synthetic environments and tasks at scale, has agents attempt them to collect demonstrations, and filters the resulting trajectories with a robust verifier. Overall, the Fara1.5 family of models delivers strong performance while being cost-effective and capable of running on-device or modest hardware.
Pushing the Frontiers of Large-Scale 3D modeling for Robotics & Beyond
3D understanding of the surrounding world is one of the key prerequisites for building truly foundational robotics models. In this demonstration, we will present to the audience some of the latest technologies that address this challenge. We will discuss a wide spectrum of methods ranging from the 3D-aware positional encoding mechanisms in Transformers to Gromov-Wasserstein techniques with geodesic distances. We will provide several robotics applications. The presentation will be complemented with the live demo involving the audience, showing some of the discussed techniques in action.
1. Learning the RoPEs: Better 2D & 3D Positional Encodings with STRING
https://arxiv.org/abs/2502.02562
https://sites.google.com/view/string-robotics
In this part of the workshop, we will present a new class of methods, called STRING, extending popular RoPE mechanisms, for the 3D-aware positional encodings in Transformers and exactly translation-invariant. We will show their applications ranging from 3D scene understanding to designing robotic policies operating on data with depth information. Presented concepts will be accompanied with the live demo, showing them in action.
2. RelFlexformer: Efficient Attention 3D Transformers for Integrable Relative Positional Encodings
https://arxiv.org/pdf/2605.10706
https://relflexformer.github.io/
RelFlexformers is a recently introduced powerful class of 3D Transformers, equipped with general additive relative positional encoding (RPE) techniques. Those techniques significantly improve downstream performance on tasks ranging from 3D classification to 3D segmentation, for both point cloud and depth images modality (often providing gains, as compared to regular Transformer models). RelFlexformers are also fully compatible with low-rank linear attention methods (such as Performers) via efficient RPE calculations with the Non-Uniform Fast Fourier Transform.
3. GenusSink: unlocking Optimal Transport with Geodesic Distances for 3D Robotics
https://arxiv.org/abs/2605.09782
In this part of the workshop, we will focus on the new class of methods for efficiently solving the Wasserstein / optimal transport (OT) problem (or their regularized Sinkhorn versions) for geodesic distances with new tools from structural graph theory and computational geometry. Geodesic distances play critical role in robotics, and are used on the regular basis in particular for graph representations of manifolds. Gromov-Wasserstein distance can be used to define the similarity between complex objects and as such are relevant for pose estimation, motion tracking and template detection techniques. Presented algorithms provide ways to conduct calculations with geodesic distances for the OT setting in the near-linear time for the regular Sinkhorn-regularized Wasserstein problem and sub-cubic (quadratic or linear) for the Gromov Wasserstein lifting (involving two metric spaces).
4. Graph Random Features
https://arxiv.org/abs/2305.00156
https://arxiv.org/abs/2310.04859
Graph Random Features (GRFs) provide new continuous representations of points in graph metric spaces defined via graph kernels as well as the representations of the entire graphs. Furthermore, they come with strong theoretical guarantees via the theory of graph kernels (potentially learnable, e.g. with deep neural networks). They are also efficient to compute, providing a gateway to explicitly modeling graphs of hundreds of thousands of nodes and more. In this part of the presentation, we will provide an introduction to the theory of GRFs, show how we can further scale them up to implicitly-defined networks, as well as: discuss several applications, most notably in: particle-based dynamics models for robotics and quantum computations.
Xiaomi GUI Agent: Cross-Device Intelligent Task Execution via Natural Language
We demonstrate Xiaomi GUI Agent, a cross-device intelligent assistant system that enables users to control smartphones through natural language commands issued on a PC. A user types a natural language instruction (e.g., "Order a coffee from the Luckin app and send me the receipt") in a PC messaging app (e.g., Feishu/Lark). The instruction is relayed to the smartphone GUI Agent, which visually perceives the screen, reasons about the task, and autonomously operates the phone — tapping buttons, typing text, navigating menus — to complete the task. The result (e.g., a screenshot or text confirmation) is sent back to the user's PC chat. This demonstration showcases the practical deployment of vision-language models for real-world GUI automation, highlighting the agent's ability to handle multi-step, cross-app tasks on real smartphones with real applications.
Scaling Deep Learning in Financial Markets
Hudson River Trading (HRT) is a global trading firm that leverages deep learning to navigate the complexities of the world's financial markets. Every day, our models process petabytes of high-fidelity data, seeking to extract signal from trillions of events across thousands of interconnected products. Operating at this scale requires a unique intersection of frontier machine learning research and high-performance engineering.
In this talk, we will discuss our approach to building and deploying large-scale market models—foundation models designed to capture the latent structures of price discovery. We’ll share insights into the research hurdles of training on massive, non-stationary datasets and the engineering constraints of performing real-time inference at microsecond scale. Beyond prediction, we will touch on the challenges of maintaining robustness amidst highly dynamic market conditions and rapid regime shifts. Join us for a look into how we apply the next generation of deep learning to navigate one of the world’s most competitive and data-rich environments.
Operationalizing Trustworthy AI in Industry
As GenAI and Agentic workflows transition from research benchmarks to core infrastructure in many traditional industries, the mandate for Trustworthy AI has shifted from a conceptual ideal to a rigorous engineering requirement. However, traditional enterprise deployment often treats trustworthiness as a checklist of independent constraints—fairness, privacy, robustness, explainability, and so on—rather than a coupled system. In this talk, we bridge the gap between algorithmic research and high-stakes deployment by examining the inherent trade-offs between aspects of trust. We demonstrate how common technical solutions meant to improve trust in one aspect often undermine trust in others. This presents a major roadblock to operationalizing algorithmic solutions in practice. Finally, we propose directions that researchers can follow to resolve these concerns, and guidance that practitioners can use today to manage these issues while facing strict regulatory and operational constraints.
Scientific benchmarks are saturating faster than the field can replace them. SciCode rose from 4.6% to 59% and HLE from 8% to 50% within a year, yet these gains show limited correlation with improvements in actual scientific workflows. The core issue is structural: current evaluations test isolated subtasks, while the highest-cost steps in scientific work (simulation setup, debugging, literature synthesis, replication) remain largely unmeasured.
This talk presents a framework for frontier scientific evaluation grounded in real-world deployment signals and workflow productivity. It rests on three components: deliverable-based task design, where the unit of evaluation is a work product rather than a final answer; productivity-oriented scoring, measuring time to reviewable draft, iteration efficiency, and salvageability under expert oversight; and substep decomposition, where tasks mirror real workflow stages and each substep is independently gradable for diagnostic signal and scalable verification.
We map O*NET task-importance data against benchmark coverage, showing that the most time-intensive workflow steps have minimal representation. We describe how real workflows in high-stakes enterprise domains such as semiconductor design and manufacturing can be captured and decomposed into structured evaluation tasks through domain collaboration, and present pilot findings comparing scientists completing multi-step tasks manually, with SOTA models, and with models fine-tuned on targeted scientific data.
Attendees will leave with the Signal-to-Value Ladder, six criteria for assessing whether an evaluation predicts real-world scientific and economic productivity. We close with a forward look: as sample-level data delivery approaches its economic ceiling, the field will need to shift from samples and datasets toward modular, composable capability infrastructure, much as software delivery evolved from project-level code to libraries and APIs.
Calibration: From Predictions to Decisions, Collaboration, and Alignment
Adaptive Reasoning in LLMs: From Post-Training to Test-Time Learning
We are seeing more numerical optimization theory papers published than ever before. These papers often make unrealistic assumptions or propose algorithms that never get adopted. So is all this optimization theory largely useless?
In this tutorial I show how some surprisingly simple optimization ideas can explain a wide variety of the implementation choices we make when training modern deep learning models. Some of these ideas might have let us skip some generations of grad-student descent, or have led to state-of-the-art tricks in modern architectures. On the other hand, I will highlight how some important practical ideas are not explained by optimization theory and where we can go from here.
Here is a list of keywords to get you (and your LLM sidekick) interested in attending: Adam and []A[]d[]a[]m[*], Muon and its friends/enemies, critical-ish batch size, the RMSnorm and skip connection love affair, dead ReLUs and living SwiGLU, Schedule-Free and WSD and muP and max_grad_norm = 1.0, variance reduction and shuffle=True, and maybe edge-of-stability/catapults/feature-learning. I may also tell you why your second-order stochastic optimization method did not work.
New Techniques for Sequence Prediction: Spectral Filtering and Preconditioning
Evaluating and Training LLMs for Math Copilots and Theorem Proving
From Digital Agents to Physical Intelligence: The Agentic Harness as a Unifying Architectural Pattern
You've built prompts. You've engineered context windows. But when your agent hallucinates in production, drops tools mid-task, or spirals in a loop—that's not a model problem. That's a harness problem.
In this workshop, we explore harness engineering in -depth: the discipline of building the runtime infrastructure that wraps a foundation model and turns it into a reliable, production-grade agent and how the same pattern translates to physical AI.
What we'll cover:
1. What a harness refers to. We trace where the term comes from, how the industry interprets it today, and why it's the layer that actually determines whether your agent works.
2. The Components & Why They Matter. We break down the six core components (context management, tool registry, verification loops, state & memory, safety controls, and observability) and show why getting these right matters more than picking the "best" model.
3. Whiteboarding the Harness and production patterns . We map out real industry patterns(master loop, initializer-worker, handoff mesh, workflow graphs) grounded in real customer case studies.
4. The Harness Effect experiment. What happens when we hold the model fixed and change only the harness?, We'll go over this experiment and its results and will show how the execution layer alone can influence an agent's outcome, you also get to try it out
5. Physical AI: The Harness Goes Embodied. We'll cover how a robot's control loop (observe → encode → decide → act → repeat) uses the same harness pattern. We break this down into the core pillars of physical intelligence with real projects cases:
- Perception & Sequential Data: Handling multi-modal sensory inputs and managing continuous time-series state
- Robotics Foundation Model: Operating the central model to drive physical decision-making under real-world constraints.
- Simulation & Performance Evaluation: Establishing safe, rigorous testing environments to isolate and benchmark execution.
Come see what it takes to build an agent that actually works, then watch the same pattern navigate the physical world.
Agentic Forecasting and Multi-Agent Ecosystems: From Predictive Reasoning to Secure Enterprise Deployment
Time-series forecasting is undergoing a fundamental shift as traditional statistical models struggle to incorporate dynamic real-world context and counterfactual logic. This workshop explores the frontier of "Agentic Forecasting," focusing on how multi-agent architectures can augment numerical predictions with qualitative reasoning and dynamic scenario planning. We will deeply examine the systems engineering required to create continuous loops of analyzing multimodal data, simulating outcomes and calibrating predictions in real-time, with a major emphasis on designing robust "What-If" agents capable of parsing complex business interventions to output verifiable and narrative-driven forecasts. Translating these advanced predictive models from architectural research into real-world impact requires orchestrating complex and domain-specific ecosystems. To demonstrate this at scale, the workshop will also highlight how multi-agent frameworks are automating specialized workflows beyond predictive analytics—ranging from autonomous scientific discovery and end-to-end publication pipelines to creative cinematic video orchestration. Finally, to ensure the reliability of these autonomous systems in high-stakes production environments, we will address the practical bottlenecks of deployment through the lens of advanced agent security, automated red-teaming and formal verification, ultimately providing attendees with actionable blueprints for building robust, secure, and highly capable agentic ecosystems.
The Generative Turn in Search and Recommendation: Foundations, Scale, and Frontiers
Search and recommendation systems — the engines powering web search, short-video feeds, music, news, advertising, and e-commerce — are undergoing their most significant paradigm shift in a decade. Multi-stage discriminative pipelines are giving way to unified generative foundation models that directly produce items, queries, and entire user trajectories, and in doing so eliminate cascaded error propagation, improve hardware utilization, and unlock optimization objectives that reach far beyond next-click prediction. Driven by rapid progress in large language models and the discovery of scaling laws for search and recommendation, this transition is quietly reshaping how the field thinks about retrieval, ranking, and personalization at the largest scales.
Realizing the paradigm in production, however, is a different challenge entirely: modern discovery platforms must serve billions of users, reason over catalogs that are approaching billions of candidate items, and respond within hundreds of milliseconds, all while training models that consume orders of magnitude more compute than their discriminative predecessors. This Expo workshop brings together leading researchers and industrial practitioners through invited talks and a moderated panel to confront the questions the community is asking right now:
- Are search and recommendation finally converging into a single foundation model?
- What are the right scaling laws for discovery, and where do they bend?
- How do generative systems remain efficient at trillion-item scale while meeting sub-50-millisecond latency budgets?
- And how should they handle non-stationary catalogs, continuous learning, cold-start items, fairness, and multi-objective trade-offs in real production traffic?
Reliable and Efficient LLM Outputs with Mellea + Granite OSS Libraries
Every LLM application eventually runs into the same wall: the model generates plausible-sounding output that is wrong, off-format, or unsafe — and there is nothing between generation and delivery to catch it. Prompting the model harder helps sometimes. However, it is not reliable.
This workshop teaches a systematic approach to the problem using two open-source IBM tools: Mellea, a Python library for structured LLM generation, and Granite Libraries, a collection of lightweight LoRA adapters that score generated output against developer-defined requirements. Together they implement an Instruct-Validate-Repair loop — generate a response, measure it against your requirements, and select or retry before it reaches the user.
Participants start with a plain chatbot that hallucinates citations, ignores formatting rules, and produces uncontrolled output. By the end of the session, the same chatbot validates every response against a set of requirements defined in plain English, generates multiple candidates in parallel, and automatically selects the best one — all running locally, all on open-source models.
No cloud accounts, no audio hardware, no frontend build. A working environment takes under five minutes to set up.
What you will build: a command-line chat application that grows module by module — from a bare Mellea generation call, to single-requirement scoring, to a parallel Best-of-N validation loop with multiple Granite Libraries adapters firing simultaneously.
What you will leave with: a mental model of how to enforce output quality programmatically, hands-on experience writing and tuning natural-language requirements, and a local codebase you can adapt to your own domain.
Technologies covered: Mellea, Granite Libraries (activated LoRA adapters), IBM Granite 4.0, Python, OpenAI-compatible inference backends (LM Studio, Ollama, vLLM).
All tools and models used are Apache 2.0 licensed and available on HuggingFace.
AI for Science: Foundation Models and Agentic Systems for Closed-Loop Discovery
AI for Science is expanding from model development to full discovery systems that can support scientific workflows. Foundation models provide general representations and generative priors for scientific data, and agentic systems organize reasoning and multi-step decision processes, forming a continuous loop of hypothesis generation, experimental design, observation, and refinement.
This workshop will examine the methodological questions that arise in building such systems across materials science and biomedicine, including superconducting materials, virtual cells, drug discovery, cell and spatial omics, and proteomics. The discussion will focus on shared challenges in multimodal representation learning, integration of scientific constraints and prior knowledge, planning over tools and experiments, coordination between models and experimental workflows, and evaluation in closed-loop discovery settings.
By bringing together researchers working on scientific foundation models and agentic AI, the workshop will provide a forum for discussing technical challenges and emerging research directions in next-generation AI for Science.