Opening Remark by Organizers
Invited talk Session 1
BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs
Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.
Real-Time Visual Attribution Streaming in Thinking Model
We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
Multi-turn Evaluation of Deep Research Agents Under Process-Level Feedback
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 11–15 points and yielding a roughly 35-37\% incorporation rate for both models; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.
LongMemEval-V2: Benchmarking Agent Memory for Experienced Colleagues
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, workflows, state dynamics, and recurring failure modes. We introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can accumulate environment-specific experience from multimodal web agent trajectories. LME-V2 contains 451 manually curated questions from customized shopping, forum, admin, and ServiceNow-style environments, with histories ranging from 25M to 115M tokens. Frontier LLMs reach at most 14.1% without trajectory evidence, confirming that LME-V2 requires learned experience beyond parametric knowledge. We evaluate memory under a context-gathering formulation and propose AgentRunbook: AgentRunbook-R is an efficient RAG pipeline over raw states, transitions, and notes, while AgentRunbook-C uses a scaffolded coding agent to gather evidence from trajectory files. AgentRunbook-C achieves the best overall accuracy, reaching 74.9% on LME-V2-Small and 70.1% on LME-V2-Medium, while improving the accuracy and latency trade-off over an off-the-shelf coding agent. We will release the benchmark and memory implementations.
Calibrate Once, Choose the Beam: A Pre-Deployment Compute-Allocation Rule for Same-LM Multimodal Search
Detection-guided Attention Steering for Vision Language Models
Modern Vision Language Models (VLMs) have demonstrated remarkable versatility in a wide range of applications, including multimodal reasoning, captioning, and analysis. However, VLMs still struggle with the fundamental tasks of object detection and localization, especially when compared to traditional Convolutional Neural Network (CNN) models. Contemporary VLMs tend to allocate very scarce attention to non-textual inputs such as image tokens and lack the inductive biases present in CNNs, making it difficult for them to identify challenging objects in complex scenes. To address these limitations, we propose a detection-guided dynamic attention steering system that leverages the locality insight from CNNs to efficiently steer a VLM's attention toward more relevant sections of an image. The steering intensity during each inference is dynamically scaled according to the confidence of CNN-generated bounding boxes. Extensive evaluations across multiple state-of-the-art VLMs show substantial improvements on object localization-oriented benchmarks, achieving up to a 4% accuracy gain. Results demonstrate the effectiveness of combining different model architectures to harness their respective strengths for advancing VLM capabilities.
Tool-Augmented VLM Agents for Zero-Shot 3D Visual Grounding
Real-Time Visual Attribution Streaming in Thinking Model
We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
Multi-turn Evaluation of Deep Research Agents Under Process-Level Feedback
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 11–15 points and yielding a roughly 35-37\% incorporation rate for both models; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Multimodal agents offer a compelling path to automating complex document-intensive workflows, yet a critical question remains: do these architectures demonstrate genuine strategic reasoning, or simply conduct stochastic trial-and-error search? To address this, we introduce Agentic Document VQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power and reliably differentiate between varying levels of agent capability. To rigorously assess agentic behavior, we introduce a novel evaluation protocol for measuring the accuracy-effort trade-off. Using this framework, we find that humans show strong metacognitive calibration, adapting or abandoning failed strategies, whereas frontier agents often persist in unproductive loops with diminishing returns. We release the dataset, evaluation harness, and leaderboard to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
WISE: Weighted Iterative Society-of-Experts for Multimodal Multi-Agent Debate with Probabilistic Consensus
Multi-agent debate (MAD) is a powerful paradigm for combining multiple large language models (LLMs) to achieve robust reasoning, but prior work has largely focused on language-only settings, leaving its multimodal potential underexplored. We present Weighted Iterative Society-of-Experts (WISE), a generalized MAD framework that systematically integrates heterogeneous multimodal LLMs to address challenging vision-and-language tasks in a zero-shot setting. Our key idea is to factor agents into three roles based on their multimodal capabilities: Solvers, which process multimodal inputs and generate candidate solutions; Reflectors, which may or may not access multimodal inputs but evaluate solutions, provide feedback, and assign weights; and an Orchestrator, which operates unimodally to reason over solutions and feedback and produce directives that guide subsequent reasoning. To account for varying agent reliability, we introduce an unsupervised probabilistic aggregation method, termed WISE–Dawid–Skene, which leverages the weighting scheme in WISE-MAD to adaptively combine agent outputs. We evaluate WISE on several challenging mathematical reasoning datasets and show that it consistently outperforms state-of-the-art methods across diverse LLM configurations, demonstrating its effectiveness as a general and scalable multimodal reasoning framework.
ALPS: Adaptive Lineage-Aware Parallel Search for LLM-Driven Optimization
Language-model agents are increasingly used as black-box optimization policies: they propose a code or configuration edit, evaluate it with a fixed harness, observe a scalar metric, and iterate. Serial agent loops use feedback efficiently but under-utilize parallel hardware, whereas naive parallel loops improve utilization but evaluate many candidates against stale or non-composable baselines. We formulate this tension as a scheduling problem over an asynchronous experiment lineage. We propose \alps (Adaptive Lineage-Aware Parallel Search), a scheduler that combines a lineage selector, an operator-level bandit, a commit-hazard controller for adaptive parallelism, and a noise-aware promotion gate. \alps treats the LLM as a candidate proposer and assigns all stateful decisions---dispatch, validation, commit, and rebase---to the scheduler. We evaluate \alps on two tasks: a small-GPT autoresearch benchmark and a Qwen3 supervised fine-tuning data-mixture search. Across both tasks, we compare three policies---serial, naive parallel, and \alps---under matched wall-clock budgets. Preliminary results suggest that lineage-aware scheduling can recover cumulative-improvement behavior while retaining the throughput advantages of parallel execution.
ASH: Agents that Self-Hone via Embodied Learning
Mastering long-horizon embodied tasks remains a fundamental challenge for AI, as current meth- ods often fail due to noisy data or intractable re- ward engineering. We introduce ASH, a fully autonomous agentic system that overcomes these limitations without any human involvement: no reward shaping, no expert annotation, and no domain-specific data curation. When encounter- ing an impasse, ASH uses its own experience to retrieve, and learn from relevant internet video. Evaluated in Pokémon Emerald—a complex RPG spanning dozens of hours—ASH dramatically out- performs baselines: while behavioral cloning and general purpose foundation models (Qwen Team, 2026) collapse to near-zero milestone comple- tion within the first few minutes, ASH sustains robust progression across multi-hour gameplay by continuously and autonomously acquiring new skills. This demonstrates that fully autonomous, self-improving agents are a scalable path for open- ended, long-horizon embodied learning.
Dyserve: Dynamic Strategy Generation for Agent Serving
Replication as Learning: Scalable Knowledge Distillation for Multimodal Enterprise Agents
Enterprise environments differ fundamentally from the clean settings assumed in LLM research: knowledge is distributed across heterogeneous sources, often incomplete or inconsistent, and key procedural logic is implicitly encoded in artifacts rather than explicitly documented. In such settings, retrieval-based approaches are insufficient, as no single source contains the full workflow. We propose a replication-driven knowledge distillation framework for scalable learning in multimodal agents. The agent learns by reverse-engineering validated artifacts (e.g., Excel workbooks), reconstructing the underlying data pipeline, and distilling the inferred logic into structured knowledge (claims, procedures, and domain patterns). This enables synthesis and validation across noisy sources and supports reuse in future tasks. We evaluate on 120 simulated enterprise environments with multimodal inputs (SQL, spreadsheets, documentation, messaging app, emails, images, PDFs, CSV) and controlled noise. Our method consistently outperforms retrieval-based baselines on both task execution and conceptual understanding, and remains robust under environmental drift.
The Orchestrator Bottleneck: Formal Coordination Strategies for Cost-Optimal Multi-Agent Enterprise Workflows
Multi-agent LLM systems increasingly depend on orchestrator agents to coordinate specialist agents, yet orchestration strategies remain informal, chosen by convention rather than analysis. We formalize the orchestrator as a constrained optimization problem over a typed agent registry with capability declarations, cost profiles, and trust boundaries. We define three coordination strategies, sequential delegation, parallel fan-out, and hierarchical decomposition, and derive threshold conditions under which each minimizes a combined cost-latency objective for a given task dependency structure. We implement OrchestRAte, a strategy-switching orchestrator that selects coordination mode dynamically via dependency graph analysis, and evaluate it on three enterprise workflow benchmarks: document processing, compliance review, and incident triage. On these benchmarks, OrchestRAte reduces end-to-end cost by 28–42% relative to static sequential baselines while trading at most 0.7 percentage points of task-completion quality. The cost gains come with a caveat: on highly parallelizable tasks, a static parallel baseline still achieves the lowest raw latency because it avoids the overhead of strategy selection entirely. We identify an empirical orchestrator overhead threshold around 15% of total token budget, beyond which the coordination cost of hierarchical strategies erodes their planning advantage.
Compositional text-to-image generation requires satisfying multiple attribute constraints simultaneously, a task that single-pass generation routinely fails on. Two recent lines of work address this from opposite ends: agentic iterative refinement trains only the language planner while freezing the image generator, and unified policy optimization trains language and visual generation components but in a single generation round. We present NEXUS, a work-in-progress framework combining both properties: iterative, multi-round refinement with joint RL training of both the LLM planner (Qwen2.5-7B) and the reference-conditioned diffusion editor (FLUX.1-Kontext-dev) via shared GRPO reward. A dual-channel Bridge routes continuous LLM hidden states and discrete text instructions to the editor's conditioning mechanism. Empirically, using three-seed means on the full 553-prompt GenEval prompt set scored with our Qwen-VQA compositional score rather than the official GenEval evaluator: (1) zero-shot Gemini with the same iterative refinement loop reaches 0.740 and serves as the matched baseline; (2) NEXUS full reaches 0.734 at step 50 and 0.780 at step 100, surpassing this matched Gemini baseline by +4.0 pp at step 100; and (3) planner-only and no-Bridge variants do not exceed the matched Gemini baseline at step 100, highlighting the importance of the full co-adaptive system.
Zero-Shot Utility and Efficient Adaptation in Vision-Language Multi-Agent Control
Vision-language models (VLMs) provide a promising starting point for multi-agent control because they can act before environment-specific training. However, it remains unclear how far this zero-shot advantage extends in partially observable cooperative settings, and how adapted VLM agents compare with specialized multi-agent reinforcement learning (MARL) policies. This work studies this question in a cooperative pursuit benchmark with controlled distribution shifts spanning visual appearance, semantic remapping, observation layout, agent counts, and environment scale. We compare zero-shot VLMs, LoRA-adapted VLMs, cold-start MARL, and fully trained IPPO/MAPPO baselines under matched local-observation constraints. Results show that zero-shot VLMs provide useful behavior where cold-start RL is nearly non-functional, and that supervised adaptation with 1k--25k expert demonstrations rapidly produces competitive long-horizon controllers. Adapted VLMs are especially robust to visual and semantic shifts, while trained MARL remains strongest on the in-distribution task and some coordination-heavy variants. Overall, the results reveal complementary scaling behavior: VLMs adapt rapidly from limited expert data and generalize better across visual-semantic shifts, while MARL remains stronger when extensive task-specific interaction is available.
F-TIS: Harnessing Diverse Models in Collaborative GRPO
Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.
Don’t Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities.
LOGIV: Logic-Graph Inference with VAL-Verification for Long-Horizon Robotic Manipulation
Long-term robotic operation has long been plagued by temporal failures during execution, as static task instructions (i.e., language conditions) fail to provide dynamic guidance for complex stages. This results in stage confusion and goal deviation in multi-stage scenarios, despite the strong short-term reaction capabilities of the underlying actuators. In this paper, we demonstrate that achieving robust long-term autonomy does not require retraining the actuators or complex hierarchical architectures, but rather temporal reparameterization of the task-instruction interface. We propose a training-free dual-system reasoning framework, LOGIV (LOgic-Graph Inference with VAL-verification), which decomposes global instructions into dynamic, multi-stage natural language priors. To ensure logical consistency, we introduce a graph-based self-correction mechanism that utilizes formal verification and standardized repair operators to autonomously correct plans generated by large language models. Experimental results on datasets such as DROID, AgiBot, EgoDex, and RoboTwin 2.0 show that our framework significantly improves task success rates without modifying the underlying model weights, with more pronounced effects on more complex tasks. Due to its architecture-agnostic and lightweight nature, LOGIV offers an efficient and plug-and-play solution for bridging the long-term temporal gap in existing world action models.
Persistent memory makes multimodal agents more capable, but it also creates a new attack surface: once unsupported content is written into memory, later retrieval and consolidation can reuse it as if it were reliable state. We study write-time defense for multimodal agent memory. Our system, SAGE-Mem, separates transient evidence from durable belief : observations may be stored as evidence, but they are promoted to belief only when they are sufficiently supported, independent, and non-conflicting. This targets a gap left by retrieval-time defenses, which act only after poisoned content has already entered memory. We evaluate on LoCoMo-Adv, an adversarial multimodal extension of LoCoMo-10, and on MM-BrowseComp-Adv, a multimodal browsing benchmark covering answer-overwrite, OCR, vision-caption, and visual-prompt attacks. On LoCoMo-Adv, at a conservative operating point, SAGE-Mem eliminates observed write admission and retrieval contamination relative to a retrieval-time baseline, but reduces benign completion under attack (0.460 vs. 0.642). On the canonical browsing overwrite setting, BrowseGuard, a browsing-specific write policy built on the same principle, blocks all 388 direct and paraphrased overwrite attempts while keeping attacked utility near its clean level (0.155 vs. 0.160). On the broader five-attack browsing suite, extending the same guarded write policy across browser, OCR, and caption channels reduces Write ASR from 0.2552 to 0.0369 and Retrieval ASR from 0.5636 to 0.3694. Overall, the results suggest that for memory-bearing agents, robustness should be evaluated not only at retrieval, but also at the point where observations become persistent state.
BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs
Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.
DEI: Diversity in Evolutionary Inference\\ for Quality-Diversity Search
We present DEI: Diversity in Evolutionary In- ference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation op- erators across peer nodes communicating with non-blocking collective operations. Unlike ho- mogeneous parallel search, which replicates a single model’s inductive biases across all work- ers, DEI treats each LLM’s distinct creative prior as a complementary source of behavioral nov- elty. Extending the Digital Red Queen frame- work with DEI, nodes share local optimal solu- tions at the end of each round to seed the next round’s population. This creates cross-model ad- versarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle in- side a simulated machine, a four-node heteroge- neous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves +124% higher merged-archive QD-Score (45.90 vs. 20.46) and +28% higher coverage (80.6% vs. 63.0% of cells) than a single-node baseline at equal total LLM-call budget. The hetero- geneous ensemble also outperforms an equally- budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
PRISM: Structured Decomposition for Multimodal Physics and Mathematical Reasoning
Multimodal STEM reasoning remains challenging because visual grounding and computation are often entangled in a single inference pass, leading to deterministic failures. We study whether explicitly decomposing these steps at test time can improve performance on multimodal physics and mathematics problems. To this end, we propose PRISM, a multi-agent framework that separates visual grounding, textual enrichment, and program-aided reasoning. Our results on the SeePhys dataset indicate that structured decomposition is most beneficial when visual dependency and reasoning complexity are high. These findings suggest that separating perception from reasoning can be a practical alternative to inference-time scaling for multimodal STEM tasks. Furthermore, we evaluate its generalization on the MATH-Vision benchmark for mathematical reasoning, demonstrating the robustness of our method.
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents
We present LS-111B, a 111B-parameter hybrid reasoning model for Korean-English enterprise agents under practical memory and serving con- straints. The model trains from a fully post- trained enterprise language model rather than a new pretraining run, and uses preamble condi- tioning to switch between concise non-reasoning behavior and longer tool-oriented reasoning. We study four choices for scaling tool-using agents efficiently: multilingual supervised fine-tuning, reinforcement learning with verifiable rewards for multi-step tool-use tasks, language-consistency re- wards for Korean user-facing responses, and 4-bit quantization for single-GPU serving. The adapted model improves mathematical reasoning, func- tion calling, and agentic natural-language-to-SQL (NL2SQL) performance while preserving general Korean and English instruction-following qual- ity. These results provide a practical recipe and failure-mode analysis for adapting post-trained multilingual models to verifiable agentic work- flows under memory-constrained deployment.
Specialist VLA: Planner-Routed LoRA Specialization for Long-Horizon Robotic Manipulation
Long-horizon robotic manipulation requires chaining semantically distinct primitives, reaching, grasping, moving, and placing, where per-primitive failures compound into end-to-end task breakdown. Existing hierarchical approaches use VLM planners to decompose instructions but route all primitives through a single generalist low-level controller, creating a planner-executor mismatch that limits reliability. We propose Specialist VLA, which retains a shared frozen TinyVLA-1.3B backbone while activating primitive-specific LoRA adapters selected by a Gemini-based planner, with dynamic re-querying every 30 control steps for mid-execution recovery. On a pick-and-place benchmark in Robosuite, Specialist VLA achieves 90% full-task success with re-querying versus 62% for a generalist baseline, demonstrating that primitive-level specialization and adaptive planning are complementary bottlenecks for long-horizon manipulation.
DPMI: A Principled Index for Neural Polysemanticity via Dirichlet Process Mixture Modeling
Reasoning Phases Are Continuous, Not Discrete: Evidence from Switching Linear Dynamical Systems Applied to Chain-of-Thought Residual Streams
Schema Discoverability, Not Locality, Drives MCP Cost Savings: A Controlled Decomposition
Announcing an MCP tool in the system prompt cuts agent cost by 56% on Claude Sonnet 4.6 (N=30, cold cache); the default Claude Code deployment omits that announcement and captures only 19% of the gap (-10.9% vs. -56.3%). Same server, same execution, same cache regime; only the system-prompt addition differs. A pure-announcement variant (announcement text, no use-instruction) reaches -45.8%, 77% of the gap, attributing the rest to instruction priming. Practitioner benchmarks reporting 32-100x MCP savings (OnlyCLI 2025; Speakeasy 2025) measure the announced regime; users running default tooling do not. Prior benchmarks do not separate this from primitive type. To isolate it, we decompose six agent memory primitives along three axes (schema availability, schema discoverability, execution locality) and run controlled contrasts. CLI-vs-script holds locality fixed and varies only schema availability; eager-vs-lazy MCP holds the server fixed and varies only announcement. Across three task families (file report, multi-file Python refactor, code Q&A) at N=30 with per-task UUID prefix-cache defeat, CLI and eager MCP overlap at -56% on report and within 9pp on the other families: locality contributes no detectable savings on report or refactor and at most ~9pp on Code Q&A. A UserPromptSubmit hook reaches -80% by pre-executing the work entirely. Task quality is preserved across all primitives (oracle pass rates 1.00/1.00/0.99 on report/refactor/Code Q&A). An exact additive re-parameterization (three axes plus a pre-execution scaling term that captures the hook ceiling) decomposes per-primitive means with moderately stable axis coefficients across families; schema discoverability alone accounts for ~40% of baseline cost. Per-seed cost correlates with agent turn counts (Pearson r=0.92 / 0.88 / 0.71 on report / refactor / Code Q&A), consistent with the three axes acting through agent reasoning volume. The ordering replicates under warm cache on Claude Sonnet 4.6 and Opus 4.7, with one notable exception: Claude Haiku 4.5 discovers the lazy-MCP schema unaided (lazy collapses to -71%), so the discoverability axis is mediated by the model's tool-discovery behavior, not a fixed structural property. Benchmark and reference implementations released as supplementary material.
Beyond the Moment: Conditioning Frozen VLAs on Memory for Long-Horizon Manipulation Tasks
Foundation robotics models, most popularly Vision Language Action (VLA) models, struggle to perform well over long horizon tasks due to their reliance on immediate sensory input. This induces compounding errors over inference timesteps, further exacerbated by non-robust backbones. To address this, we introduce \textit{Training-Free Memory Conditioned Action Generation}, a non-parametric retrieval-augmented framework that conditions a frozen VLA on historical expert trajectories. Our approach constructs a memory of expert demonstrations and utilizes a state-centric retrieval mechanism to guide action generation without any fine-tuning whatsoever. By performing extensive evaluation on 5 datasets over SOTA models, we show relative gains of upto 27% on task completion success. As an additional contribution we extend the popular CALVIN benchmark to task horizon of 6 and beyond, showcasing relative gains of upto 30%, while also demonstrating robustness to corrupted observations. Real-world experiments on complex tasks further demonstrate performance gains of up to 2X.
Gym-Anything: Turn Any Software into an Agent Environment
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
Invited talk Session 2
Gym-Anything: Turn Any Software into an Agent Environment
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
LongMemEval-V2: Benchmarking Agent Memory for Experienced Colleagues
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, workflows, state dynamics, and recurring failure modes. We introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can accumulate environment-specific experience from multimodal web agent trajectories. LME-V2 contains 451 manually curated questions from customized shopping, forum, admin, and ServiceNow-style environments, with histories ranging from 25M to 115M tokens. Frontier LLMs reach at most 14.1% without trajectory evidence, confirming that LME-V2 requires learned experience beyond parametric knowledge. We evaluate memory under a context-gathering formulation and propose AgentRunbook: AgentRunbook-R is an efficient RAG pipeline over raw states, transitions, and notes, while AgentRunbook-C uses a scaffolded coding agent to gather evidence from trajectory files. AgentRunbook-C achieves the best overall accuracy, reaching 74.9% on LME-V2-Small and 70.1% on LME-V2-Medium, while improving the accuracy and latency trade-off over an off-the-shelf coding agent. We will release the benchmark and memory implementations.
Exploratory and Assimilating Reflection: Reflective Recall Cycle for Long-term Memory
LLM-based autonomous agents require external memory to overcome their statelessness and limited context window for long-term interaction and dynamic knowledge reasoning. However, existing memory retrieval methods often lack adaptability and sample efficiency, and struggle to retrieve the right mixture of memories from heterogeneous stores. We propose \textit{Exploratory-Assimilating Reflection (EAR)}, a framework for high initial retrieval performance and sample-efficient adaptation. EAR combines two mechanisms: Exploratory Reflection, which performs iterative search to bootstrap retrieval and collect useful experiences for each query, and Assimilating Reflection, which replays these experiences from an Experience Buffer to refine a global reranker more efficiently than methods relying only on immediate rewards. Experiments show that EAR improves retrieval by up to 17.9% over the baseline retriever on two long-term dialogue benchmarks. We also show that EAR is highly sample-efficient and robust to noisy feedback.