A standard approach to representing a video is via a fixed spatiotemporal grid of tokens corresponding to the original 3D structure of the signal. These tokenization approaches, however, result in a fixed-length token sequence that is independent of the underlying input complexity. In addition, this grid structure biases tokens to focus on and capture local information from the original signal. In this work, we develop a tokenizer that learns to represent an input video in a coarse-to-fine manner, where early tokens encode the most salient semantic features of the whole video, while later tokens incrementally refine the representation with more fine-grained details. Additionally, we introduce an autoregressive temporal loss over the learned tokens that serves two purposes: first, it makes the tokens more suitable for subsequent autoregressive video modeling; second, it encourages the learning of higher-level abstractions that are more predictable over time. We study the representations learned through this process and evaluate their usefulness for downstream applications such as video modeling.
Multi-Agent System Design and Evaluation for Quantitative Finance
Quantitative finance imposes constraints that stress-test general-purpose agent architectures: data is non-stationary, latency budgets are tight, and subtle errors in temporal reasoning can invalidate an entire research pipeline. At Jump Trading, we build multi-agent systems that operate under these constraints across thousands of instruments and terabytes of daily market data, searching for structure in a regime characterized by extremely low signal-to-noise ratios and adversarial selection against all but the most rigorously designed strategies.
In this talk, we present results from firmwide and trading-specific benchmarks evaluating multi-agent architectures. Starting from a baseline of frontier single-agent systems in commonly used terminal harnesses, we compare variations across harness design, context management, inter-agent communication, parallel execution, and post-training for task specialization. We characterize the architectural choices in task decomposition, context scoping, and workflow structure that justify the additional complexity of harness design and post-training, highlighting environments where multi-agent systems and domain-focused subagents outperform a single context-rich frontier model. Finally, we discuss methods to improve evaluation quality in complex domains such as quantitative finance where proprietary data, scarce human labels, and heterogeneous composition of both tasks and technical environments preclude reliance on publicly available benchmarks. Additionally, we present critical steps towards deriving information-theoretic bounds as a function of entropy that guide the convergence of agent-based processes.
As large language models (LLMs) evolve from short-burst chatbots into long-horizon autonomous agents, progress is increasingly bottlenecked by verification asymmetry: rapid gains in domains with cheap correctness signals (e.g., math and code) contrast sharply with limited progress in tasks with weak or delayed verification, such as research planning and strategic decision-making. This talk argues that evaluation and reinforcement learning (RL) beyond easily verifiable domains are the next critical frontier for AI capability.
We present results from three new evaluation frameworks. Humanity’s Last Exam (HLE) shows that frontier models are frequently wrong and overconfident at the human-expert level. The Remote Labor Index (RLI) demonstrates that current agents automate only ~2.5% of real, paid freelance work. Visual ToolBench reveals that 70–80% of multimodal agent failures stem from visual perception rather than reasoning.
To close these gaps, we introduce Rubrics as Rewards (RaR) within a Group Relative Policy Optimization (GRPO) framework. We show that Dynamic Rubrics, which adaptively elicit evaluation criteria by contrasting model outputs during training, outperform static human-written rubrics and reduce reward hacking in the high-reward regime. These findings motivate a shift from static benchmarks to high-fidelity RL environments, such as Scale Gymnasium, that train agents through interaction rather than imitation.
Strands Robots: Unifying Robot Control, Simulation, and Training Behind Natural Language
Despite rapid advances in vision-language-action (VLA) models, deploying robot intelligence remains fragmented: different SDKs for different robots, different policy frameworks with incompatible interfaces, an unbridged simulation-to-reality gap, and training pipelines that demand specialist expertise. We present Strands Robots, an open-source Python SDK that unifies the complete robot lifecycle—simulation, control, training, and deployment—behind natural language. Our central contribution is a Policy abstraction layer with a plugin registry supporting 18 VLA/WFM providers (50+ aliases) under a single three-method interface, enabling zero-code-change transfer between simulation and real hardware. We scale this abstraction across three axes: robot diversity (35 bundled models from MuJoCo Menagerie spanning arms, humanoids, quadrupeds, and dexterous hands), simulation fidelity (three backends: MuJoCo CPU, Newton GPU-differentiable with 4,096+ parallel environments, and Isaac Sim with RTX rendering), and policy ecosystem breadth (from 42M-parameter ONNX humanoid controllers at 135 Hz to 14B-parameter world action models). Built on the Strands Agents framework, every capability is exposed as a tool callable via natural language, enabling AI agents to autonomously design scenes, run experiments, collect data, train policies, and deploy to hardware. We demonstrate that this unified approach achieves practical results: GEAR-SONIC humanoid whole-body control at 135 Hz on Jetson AGX Thor, Cosmos Predict 2.5 achieving 98.5% on LIBERO-10, and seamless integration with NVIDIA's GR00T N1.5/N1.6, Cosmos Transfer 2.5, DreamGen, and DreamZero pipelines. Strands Robotsestablishes a practical foundation for autonomous robot development where the barrier between idea and physical action is a single line of Python.
Unlearning Data at Scale
Probabilistic Numerics — Computation is Machine Learning
Unifying Attention and Diffusion with Kan Extension Transformers: Structured Deep Learning with Diagrammatic Backpropagation
Diffusion and Flow-Matching: From Memorization to Generalization & Beyond
Proving Theorems with Lean and Machine Learning
vLLM-Hook: Live Programming of Model Internals on vLLM
vLLM-Hook is a modular plug-in library for vLLM that lets developers and researchers inspect, analyze, and intervene on internal model states during inference. The talk will present the core design of vLLM-Hook, including its configuration-driven hook interface, support for passive programming and active programming, and compatibility with practical deployment workflows. We will show how the system exposes internal signals such as attentions, attention heads, and activations, and how these signals can be used for real-time monitoring and controlled intervention without requiring model retraining. The session will highlight three concrete use cases from the project: prompt-injection detection through in-model monitoring, retrieval enhancement through selective retrieval and reranking signals, and activation steering for controlled generation. The goal of the talk is to give practitioners a clear view of how model-internal programming can become a practical capability in modern LLM serving stacks built on vLLM.
The End-to-End AI Scientist: Automating Discovery and the Research Pipeline
Deploying AI for highly complex, open-ended discovery tasks exposes the limitations of current context windows and reasoning capabilities. This talk explores our latest multi-agent system designed to automate the deep scientific research pipeline. We will detail the "AI Scientist," an end-to-end architecture that autonomously generates novel hypotheses, translates concepts into experimental ML pipelines, and synthesizes verifiable research papers. We will unpack the design of specialized sub-agents, particularly focusing on our expert reviewer agent (ScholarPeer) which uses a novel "historian" approach to verify true novelty against historical literature, and the automated visual generator (PaperBanana) which tackles the complex multi-modal challenge of synthesizing accurate methodology figures. Attendees will gain insights into how we orchestrate and evaluate these workflows to reliably accelerate the pace of AI discovery.
Seoul World Model: Grounding World Simulation Models in a Real-World Metropolis
What if world simulation models could operate not in imagined environments, but directly on real, living cities?
In this talk, we present Seoul World Model (SWM), a city-scale world simulation model grounded in real-world geospatial data. Unlike prior world models that generate plausible yet fictional environments, SWM leverages large-scale street-view imagery and retrieval-augmented generation to produce temporally consistent, spatially faithful simulations of an actual metropolis.
We will briefly introduce the core technical ideas behind SWM -- including cross-temporal pairing, synthetic trajectory augmentation, and the Virtual Lookahead Sink for long-horizon stability -- and demonstrate how these enable controllable, kilometer-scale simulation with realistic geometry, motion, and user-driven scenarios.
Beyond the technical contribution, this session aims to situate SWM within the rapidly evolving landscape of world models and physical AI. We will discuss concurrent efforts from both academia and industry, highlighting key differences between imagined-world generation and real-world grounded simulation.
The session will conclude with a panel discussion focusing on:
* Open research challenges, including dynamic object modeling, data quality, and scaling real-world grounding
* Business opportunities for companies operating large-scale street-view or map platforms (e.g., simulation-as-a-service, digital twins, autonomous driving data generation)
* Strategic implications of geospatial foundation models in the era of sovereign AI
By bringing together perspectives from research and industry, this session explores how world models can evolve from generative media technologies into core infrastructure for understanding and simulating the physical world.
From Thinking to Doing: Design Principles for Scaling Native Agent Capabilities with the Open-Source MiMo-V2.5 Model Family
The competitive frontier of foundation models has shifted from single-turn reasoning to sustained autonomous execution. The central question is no longer how well a model thinks, but whether it can operate as a reliable agent — maintaining coherence across thousands of decision steps, coordinating multimodal perception and action, and doing so within practical cost budgets. This talk distills three transferable design principles from our experience building and deploying the MiMo-V2.5 open-source model family.
Long-horizon stability requires architectural guarantees. A hybrid sliding-window / global attention scheme (6:1 ratio) compresses KV-cache by ~7× and enables native million-token context. MiMo-V2.5-Pro (1.02T / 42B active) sustains coherent trajectories over nearly 2,000 tool calls — autonomously completing a full SysY compiler in Rust (4.3 h) and an 8,192-line video editor (11.5 h), both passing all tests on first submission.
Token efficiency is the binding constraint on deployability. Architectural compression, 3-layer multi-token prediction, and Multi-Teacher On-Policy Distillation (MOPD) jointly yield 40–60% token savings over frontier models at matched performance (SWE-Bench Verified 78.9, TerminalBench 2.0 68.4).
Omnimodality closes the perception-expression loop. MiMo-V2.5 (310B / 15B active, 48T training tokens) unifies vision, audio, and language in a single sparse MoE. MiMo-V2.5-TTS enables instruction-steered emotion and timbre control with zero-shot voice design and few-second cloning. MiMo-V2.5-ASR achieves state-of-the-art recognition across dialects, code-switching, and noisy conditions via RL-augmented training.
All models are MIT-licensed. We share the trade-offs, failure modes, and scaling lessons behind each principle, and conclude with open questions toward agents capable of narrative-level planning and closed-loop embodied action.
Model Optimization Flywheel: Continuously Self-Improving LLMs in Production
We present Shopify's Model Optimization Flywheel, a practical methodology for turning frontier-quality LLM behavior into faster, cheaper, and continuously improving production systems. The flywheel starts with reliable evaluation: LLM-as-judge evaluators grounded in human-labeled data become the canonical metrics for prompt optimization, distillation, and production regressions.
Using Tangle-powered experimentation workflows, we optimize frontier-model system prompts, collect training data from production A/B traffic and synthetic merchant/user rollouts, and distill smaller models with SFT, on-policy distillation, and GRPO. These models can replicate, and in some cases exceed, frontier-model behavior at much lower serving cost. We then compress prompts with gist tokens to reduce context overhead and improve latency.After deployment, the loop continues by sampling low-scoring production conversations, using stronger reasoning models to critique and "heal" them, folding repaired examples back into training, and re-running distillation. This flywheel has reduced serving cost and latency while improving production quality. We will share concrete recipes, quality-cost-latency trade-offs, and a blueprint for building self-improving LLM systems that get better and cheaper over time.
Xiaomi GUI Agent: Cross-Device Intelligent Task Execution via Natural Language
We demonstrate Xiaomi GUI Agent, a cross-device intelligent assistant system that enables users to control smartphones through natural language commands issued on a PC. A user types a natural language instruction (e.g., "Order a coffee from the Luckin app and send me the receipt") in a PC messaging app (e.g., Feishu/Lark). The instruction is relayed to the smartphone GUI Agent, which visually perceives the screen, reasons about the task, and autonomously operates the phone — tapping buttons, typing text, navigating menus — to complete the task. The result (e.g., a screenshot or text confirmation) is sent back to the user's PC chat. This demonstration showcases the practical deployment of vision-language models for real-world GUI automation, highlighting the agent's ability to handle multi-step, cross-app tasks on real smartphones with real applications.
Show Your Work: Real-Time, Per-Turn Requirement Validation in an Open-Source Voice Assistant
We are demonstrating a live voice assistant, built on open IBM Granite 4.1 models, that lets ICML attendees watch a language model check its own work in real time, turn by turn, during a natural spoken conversation. Attendees walk up, speak to the assistant, and watch a panel beside it light up as the system generates several candidate responses in parallel and scores each one against a set of plain-English requirements: how the answer should sound, how long it should be, what it shouldn't say. Passing requirements turn green, failures turn red, and the first candidate that satisfies all of them is spoken back. Failures are shown, not hidden. Attendees can edit the requirements on the fly and hear the assistant's behavior change mid-conversation. The demonstration is hands-on and built for a research audience working on validated and controllable generation. Every piece of it (models, orchestration, frontend) is Apache-2.0 and runs on a single laptop with no external API, so any attendee can reproduce it after the session.
AI for International E-Commerce: Generative Recommendation, Conversational Shopping Agents, and Multimodal Supply Chain Models
We will present 3 AI systems for international e-commerce, deployed across AliExpress, Lazada, and supply chain. (1) Generative Recommendation at AliExpress — A generative recommendation system that reshapes traditional product discovery pipelines with generative AI techniques. It delivers substantial improvements in core business metrics, system compute utilization, provides highly adaptable support for diverse business requirements and unlocks richer shopping experiences. Attendees can interactively input queries categories to see real-time AI-generated personalized recommendations. (2) LazzieChat at Lazada — An AI chat assistant and intelligent shopping guide system integrating large models, agents, multimodal understanding, and e-commerce knowledge. Built around "user intent recognition" and "intent fulfillment," it covers product comparison, bundling, alternative recommendations, and list recommendations. It addresses long-tail queries through a Query Intent Rewrite Agent, Attendees will engage in multi-turn conversations, test visual search by uploading images, and experience localization features for Southeast Asian users. (3) Supply Chain Multi-Objective Multimodal Large Model with RLVR at AIDC — A technical presentation on a unified VLM architecture addressing "physical perception defocus" and "industry logic gap" in logistics tasks (weight estimation, dimension estimation, HSCode prediction, seasonality classification, logistics attributes, foldability discrimination). We will present the hybrid stratified resampling strategy, SFT + RLVR joint training, multi-dimensional dynamic weighted reward function, and GDPO algorithm that severs spurious causal chains. We will share experimental results showing significant improvements over Qwen closed-source model across all six tasks, and introduce the "Season Identification Agent" combining human-machine collaboration with temporal sales features.
Fara1.5: A Family of Computer Use Agents
We present Fara1.5, a family of lightweight Computer Use Agent (CUA) models at three scales 4B, 9B, and 27B -- each achieving state-of-the-art results on browser-use benchmarks among models of comparable size (e.g., 56.6% on Online-Mind2Web and 84.9% on WebVoyager for Fara1.5-9B). These models are the next evolution of Fara-7B and make many advancements, such as more robust user interaction/oversight, more efficient task execution, and overall stronger task completion performance. The key driver behind these models’ capabilities is our synthetic data generation pipeline, FaraGen1.5, that builds synthetic environments and tasks at scale, has agents attempt them to collect demonstrations, and filters the resulting trajectories with a robust verifier. Overall, the Fara1.5 family of models delivers strong performance while being cost-effective and capable of running on-device or modest hardware.
Interaction Design for Non-Deterministic Agents: Principles from Human-AI Collaboration Systems
The shift from deterministic software to probabilistic, agentic AI systems breaks the foundational assumptions of interaction design. For decades, we designed and optimized software for human control and predictability. Now non-deterministic systems are designing non-deterministic systems for users whose mental models of software haven't changed in 20 years. This talk examines what changes in practice when both the tool and the artifact it produces are probabilistic. Drawing on experience building and deploying autonomous agents at Amazon, we address four interconnected challenges: (1) how to surface agent behavior so users can calibrate trust without being overwhelmed (observability), (2) how to design for both success and failure when the same input can yield meaningfully different results (variability), (3) how to structure human oversight and agent confidence so both scale toward complete autonomy (human-in-the-loop), and (4) how to verify that the agent understands the user's intent across open-ended, long-horizon tasks (alignment).
Pushing the Frontiers of Large-Scale 3D modeling for Robotics & Beyond
3D understanding of the surrounding world is one of the key prerequisites for building truly foundational robotics models. In this demonstration, we will present to the audience some of the latest technologies that address this challenge. The presentation will have the form of the interactive demo (details in the "Live Action" section below). We will show in particular how to efficiently process massive point clouds with hundreds of thousands of nodes and how to integrate powerful invariances and equivariances into modeling to enhance 3D understanding.
Tangent: Autonomous Auto-Research Agent for ML Pipelines
Tangent is an autonomous ML research agent built on Tangle, an open-source cross-cloud ML pipeline orchestrator (available on HuggingFace). Unlike hyperparameter search tools, Tangent automates the full iterative research cycle: given a goal — reproduce a result, fine-tune a model, push a metric — it designs a Tangle pipeline, selects components and hyperparameters, launches runs on CPU or GPU clusters, diagnoses its own failures, and iterates, with or without a human in the loop. Findings are captured in a persistent learnings corpus so each experiment builds on the last. In production use at Shopify, Tangent has reduced ML experiment cycles from ~3-7 days to ~1 day. We will demo Tangent live: starting from a research prompt, watching the agent build and submit a pipeline to a real compute cluster, and discussing results as they arrive.
Scaling Deep Learning in Financial Markets
Hudson River Trading (HRT) is a global trading firm that leverages deep learning to navigate the complexities of the world's financial markets. Every day, our models process petabytes of high-fidelity data, seeking to extract signal from trillions of events across thousands of interconnected products. Operating at this scale requires a unique intersection of frontier machine learning research and high-performance engineering.
In this talk, we will discuss our approach to building and deploying large-scale market models—foundation models designed to capture the latent structures of price discovery. We’ll share insights into the research hurdles of training on massive, non-stationary datasets and the engineering constraints of performing real-time inference at microsecond scale. Beyond prediction, we will touch on the challenges of maintaining robustness amidst highly dynamic market conditions and rapid regime shifts. Join us for a look into how we apply the next generation of deep learning to navigate one of the world’s most competitive and data-rich environments.
Operationalizing Trustworthy AI in Industry
As GenAI and Agentic workflows transition from research benchmarks to core infrastructure in many traditional industries, the mandate for Trustworthy AI has shifted from a conceptual ideal to a rigorous engineering requirement. However, traditional enterprise deployment often treats trustworthiness as a checklist of independent constraints—fairness, privacy, robustness, explainability, and so on—rather than a coupled system. In this talk, we bridge the gap between algorithmic research and high-stakes deployment by examining the inherent trade-offs between aspects of trust. We demonstrate how common technical solutions meant to improve trust in one aspect often undermine trust in others. This presents a major roadblock to operationalizing algorithmic solutions in practice. Finally, we propose directions that researchers can follow to resolve these concerns, and guidance that practitioners can use today to manage these issues while facing strict regulatory and operational constraints.
Scientific benchmarks are saturating faster than the field can replace them. SciCode rose from 4.6% to 59% and HLE from 8% to 50% within a year, yet these gains show limited correlation with improvements in actual scientific workflows. The core issue is structural: current evaluations test isolated subtasks, while the highest-cost steps in scientific work (simulation setup, debugging, literature synthesis, replication) remain largely unmeasured.
This talk presents a framework for frontier scientific evaluation grounded in real-world deployment signals and workflow productivity. It rests on three components: deliverable-based task design, where the unit of evaluation is a work product rather than a final answer; productivity-oriented scoring, measuring time to reviewable draft, iteration efficiency, and salvageability under expert oversight; and substep decomposition, where tasks mirror real workflow stages and each substep is independently gradable for diagnostic signal and scalable verification.
We map O*NET task-importance data against benchmark coverage, showing that the most time-intensive workflow steps have minimal representation. We describe how real workflows in high-stakes enterprise domains such as semiconductor design and manufacturing can be captured and decomposed into structured evaluation tasks through domain collaboration, and present pilot findings comparing scientists completing multi-step tasks manually, with SOTA models, and with models fine-tuned on targeted scientific data.
Attendees will leave with the Signal-to-Value Ladder, six criteria for assessing whether an evaluation predicts real-world scientific and economic productivity. We close with a forward look: as sample-level data delivery approaches its economic ceiling, the field will need to shift from samples and datasets toward modular, composable capability infrastructure, much as software delivery evolved from project-level code to libraries and APIs.
New Techniques for Sequence Prediction: Spectral Filtering and Preconditioning
Adaptive Reasoning in LLMs: From Post-Training to Test-Time Learning
Evaluating and Training LLMs for Math Copilots and Theorem Proving
Calibration: From Predictions to Decisions, Collaboration, and Alignment
Agentic Harness: Building Reliable AI Agent Systems
AI agents are moving beyond prompt-response into long-horizon, tool-using autonomy — but the model alone isn't production-ready. The agentic harness is the orchestration and governance layer that wraps an LLM agent to provide what it can't on its own: reliability, safety, observability, memory, and human oversight.
The harness concept is emerging simultaneously across industry (AgentCore, LangGraph, CrewAI, AutoGen) and research (Reflexion, Constitutional AI, ToolEmu, TrustAgent), but these communities aren't yet talking to each other. This workshop bridges that gap and explores the harness as a framework-agnostic architectural pattern, examining its core components and the open research problems in each.
Session (30m each)
Topic
Description
1
The Agentic Harness Pattern
Why agents fail in production. Six core components, real-world use cases, and clarifying harness vs. scaffolding vs. orchestrator vs. framework.
2
Guardrails & Safety
Red-team evaluation of guardrail implementations. Domain-specific constitutional safety case study (SafeLab): multi-agent debate + Reflexion vs. generic filtering.
3
Memory & Observability
Comparing memory architectures (in-context, RAG, vector store, hybrid). OpenTelemetry-based tracing for non-deterministic agent systems.
4
Live Demo
SafeLab live — attendees submit proposals, watch multi-agent safety debate in real time.
5
Panel Discussion
Is harness engineering a new discipline? Build vs. buy. Framework portability. What gets commoditized as models improve?
6
Open Q&A & Wrap-up
Audience discussion, open research priorities, collaboration opportunities.
Reliable and Efficient LLM Outputs with Mellea + Granite OSS Libraries
Every LLM application eventually runs into the same wall: the model generates plausible-sounding output that is wrong, off-format, or unsafe — and there is nothing between generation and delivery to catch it. Prompting the model harder helps sometimes. However, it is not reliable.
This workshop teaches a systematic approach to the problem using two open-source IBM tools: Mellea, a Python library for structured LLM generation, and Granite Libraries, a collection of lightweight LoRA adapters that score generated output against developer-defined requirements. Together they implement an Instruct-Validate-Repair loop — generate a response, measure it against your requirements, and select or retry before it reaches the user.
Participants start with a plain chatbot that hallucinates citations, ignores formatting rules, and produces uncontrolled output. By the end of the session, the same chatbot validates every response against a set of requirements defined in plain English, generates multiple candidates in parallel, and automatically selects the best one — all running locally, all on open-source models.
No cloud accounts, no audio hardware, no frontend build. A working environment takes under five minutes to set up.
What you will build: a command-line chat application that grows module by module — from a bare Mellea generation call, to single-requirement scoring, to a parallel Best-of-N validation loop with multiple Granite Libraries adapters firing simultaneously.
What you will leave with: a mental model of how to enforce output quality programmatically, hands-on experience writing and tuning natural-language requirements, and a local codebase you can adapt to your own domain.
Technologies covered: Mellea, Granite Libraries (activated LoRA adapters), IBM Granite 4.0, Python, OpenAI-compatible inference backends (LM Studio, Ollama, vLLM).
All tools and models used are Apache 2.0 licensed and available on HuggingFace.
Search and recommendation systems — the engines powering web search, short-video feeds, music, news, advertising, and e-commerce — are undergoing their most significant paradigm shift in a decade. Multi-stage discriminative pipelines are giving way to unified generative foundation models that directly produce items, queries, and entire user trajectories, and in doing so eliminate cascaded error propagation, improve hardware utilization, and unlock optimization objectives that reach far beyond next-click prediction. Driven by rapid progress in large language models and the discovery of scaling laws for search and recommendation, this transition is quietly reshaping how the field thinks about retrieval, ranking, and personalization at the largest scales.
Realizing the paradigm in production, however, is a different challenge entirely: modern discovery platforms must serve billions of users, reason over catalogs that are approaching billions of candidate items, and respond within hundreds of milliseconds, all while training models that consume orders of magnitude more compute than their discriminative predecessors. This Expo workshop brings together leading researchers and industrial practitioners through invited talks and a moderated panel to confront the questions the community is asking right now:
- Are search and recommendation finally converging into a single foundation model?
- What are the right scaling laws for discovery, and where do they bend?
- How do generative systems remain efficient at trillion-item scale while meeting sub-50-millisecond latency budgets?
- And how should they handle non-stationary catalogs, continuous learning, cold-start items, fairness, and multi-objective trade-offs in real production traffic?
Agentic Forecasting and Multi-Agent Ecosystems: From Predictive Reasoning to Secure Enterprise Deployment
Time-series forecasting is undergoing a fundamental shift as traditional statistical models struggle to incorporate dynamic real-world context and counterfactual logic. This workshop explores the frontier of "Agentic Forecasting," focusing on how multi-agent architectures can augment numerical predictions with qualitative reasoning and dynamic scenario planning. We will deeply examine the systems engineering required to create continuous loops of analyzing multimodal data, simulating outcomes and calibrating predictions in real-time, with a major emphasis on designing robust "What-If" agents capable of parsing complex business interventions to output verifiable and narrative-driven forecasts. Translating these advanced predictive models from architectural research into real-world impact requires orchestrating complex and domain-specific ecosystems. To demonstrate this at scale, the workshop will also highlight how multi-agent frameworks are automating specialized workflows beyond predictive analytics—ranging from autonomous scientific discovery and end-to-end publication pipelines to creative cinematic video orchestration. Finally, to ensure the reliability of these autonomous systems in high-stakes production environments, we will address the practical bottlenecks of deployment through the lens of advanced agent security, automated red-teaming and formal verification, ultimately providing attendees with actionable blueprints for building robust, secure, and highly capable agentic ecosystems.
AI for Science: Foundation Models and Agentic Systems for Closed-Loop Discovery
AI for Science is expanding from model development to full discovery systems that can support scientific workflows. Foundation models provide general representations and generative priors for scientific data, and agentic systems organize reasoning and multi-step decision processes, forming a continuous loop of hypothesis generation, experimental design, observation, and refinement.
This workshop will examine the methodological questions that arise in building such systems across materials science and biomedicine, including superconducting materials, virtual cells, drug discovery, cell and spatial omics, and proteomics. The discussion will focus on shared challenges in multimodal representation learning, integration of scientific constraints and prior knowledge, planning over tools and experiments, coordination between models and experimental workflows, and evaluation in closed-loop discovery settings.
By bringing together researchers working on scientific foundation models and agentic AI, the workshop will provide a forum for discussing technical challenges and emerging research directions in next-generation AI for Science.