Skip to yearly menu bar Skip to main content


Timezone: Asia/Seoul
Filter Events

Opening Remark by Organizers

Jaehong Yoon ⋅ Souvik Kundu ⋅ Digbalay Bose
8:00 AM - 8:15 AM

Invited talk Session 1

James Zou ⋅ Mengdi Wang ⋅ Mohit Bansal
8:15 AM - 9:45 AM

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon ⋅ Sunkyoung Kim ⋅ Hyesoo Hong ⋅ Wonje Jeung ⋅ Yongil Kim ⋅ Wooseok Seo ⋅ Heuiyeen Yeen ⋅ Albert No
9:45 AM - 10:15 AM

Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

... more

Real-Time Visual Attribution Streaming in Thinking Model

Seil Kang ⋅ Woojung Han ⋅ Junhyeok Kim ⋅ Jinyeong Kim ⋅ Youngeun Kim ⋅ Seong Jae Hwang
9:45 AM - 10:15 AM

We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.

... more

Multi-turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal ⋅ Hongru WANG ⋅ Amos Storkey ⋅ Jeff Pan
9:45 AM - 10:15 AM

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 11–15 points and yielding a roughly 35-37\% incorporation rate for both models; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.

... more
9:45 AM - 10:15 AM
10:15 AM - 10:30 AM

LongMemEval-V2: Benchmarking Agent Memory for Experienced Colleagues

Di Wu ⋅ Zixiang Ji ⋅ Asmi Kawatkar ⋅ Bryan Kwan ⋅ Jia-Chen Gu ⋅ Nanyun Peng ⋅ Kai-Wei Chang
10:30 AM - 11:30 AM

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, workflows, state dynamics, and recurring failure modes. We introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can accumulate environment-specific experience from multimodal web agent trajectories. LME-V2 contains 451 manually curated questions from customized shopping, forum, admin, and ServiceNow-style environments, with histories ranging from 25M to 115M tokens. Frontier LLMs reach at most 14.1% without trajectory evidence, confirming that LME-V2 requires learned experience beyond parametric knowledge. We evaluate memory under a context-gathering formulation and propose AgentRunbook: AgentRunbook-R is an efficient RAG pipeline over raw states, transitions, and notes, while AgentRunbook-C uses a scaffolded coding agent to gather evidence from trajectory files. AgentRunbook-C achieves the best overall accuracy, reaching 74.9% on LME-V2-Small and 70.1% on LME-V2-Medium, while improving the accuracy and latency trade-off over an off-the-shelf coding agent. We will release the benchmark and memory implementations.

... more
10:30 AM - 11:30 AM
Agentic planners must decide whether a fixed inference budget should buy more complete samples or a managed search frontier. We give a pre-deployment rule for this choice. The deployed LM, used as its own pruning verifier, is calibrated once to a precision $p$ on locally true-improving partial states. Combining $p$ with the mean horizon $\bar{L}$ yields a frontier-survival score $A_k \approx [1 - (1-p)^k]^{\bar{L}}$, and the smallest beam $k$ whose score clears a target success rate is the recommended width. The oracle labels used to measure $p$ appear only on a small calibration split, never inside the test-time solver. On our 100-maze MazeBench (4×4 to 6×6, generator-controlled), the rule predicts the useful regime at $k=3$ from a single calibration scalar $p=0.816$. Same-LM guided search then solves 98 of 100 mazes in 14.4 seconds per maze on one A100, while SC-10 solves 9 in 17.7 seconds. A Qwen2.5-3B holdout, a Qwen2.5-VL-7B run on rendered visual mazes (40/50 vs 1/50 for SC-10), and a same-LM two-ply Gomoku tournament (100/100 wins) all land on the safe side of the calibrated boundary, yielding a design map from $(p, \bar{L})$ to the minimum admissible beam across model size, modality, and task.
... more

Detection-guided Attention Steering for Vision Language Models

Alan W Zhang ⋅ Rui Pan ⋅ Mike Wong ⋅ Ravi Netravali
10:30 AM - 11:30 AM

Modern Vision Language Models (VLMs) have demonstrated remarkable versatility in a wide range of applications, including multimodal reasoning, captioning, and analysis. However, VLMs still struggle with the fundamental tasks of object detection and localization, especially when compared to traditional Convolutional Neural Network (CNN) models. Contemporary VLMs tend to allocate very scarce attention to non-textual inputs such as image tokens and lack the inductive biases present in CNNs, making it difficult for them to identify challenging objects in complex scenes. To address these limitations, we propose a detection-guided dynamic attention steering system that leverages the locality insight from CNNs to efficiently steer a VLM's attention toward more relevant sections of an image. The steering intensity during each inference is dynamically scaled according to the confidence of CNN-generated bounding boxes. Extensive evaluations across multiple state-of-the-art VLMs show substantial improvements on object localization-oriented benchmarks, achieving up to a 4% accuracy gain. Results demonstrate the effectiveness of combining different model architectures to harness their respective strengths for advancing VLM capabilities.

... more

Tool-Augmented VLM Agents for Zero-Shot 3D Visual Grounding

Cuong Huynh ⋅ Maxim Popov ⋅ Denis Gridusov ⋅ Sergey Kolyubin
10:30 AM - 11:30 AM
3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and avoids prompts filled with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +1.5% on Nr3D, with a notable +7.6% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding.
... more

Real-Time Visual Attribution Streaming in Thinking Model

Seil Kang ⋅ Woojung Han ⋅ Junhyeok Kim ⋅ Jinyeong Kim ⋅ Youngeun Kim ⋅ Seong Jae Hwang
10:30 AM - 11:30 AM

We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.

... more

Multi-turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal ⋅ Hongru WANG ⋅ Amos Storkey ⋅ Jeff Pan
10:30 AM - 11:30 AM

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 11–15 points and yielding a roughly 35-37\% incorporation rate for both models; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.

... more

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Lukasz Borchmann ⋅ Jordy Van Landeghem ⋅ Michał Turski ⋅ Shreyansh Padarha ⋅ Ryan Kearns ⋅ Adam Mahdi ⋅ Niels Rogge ⋅ Clémentine Fourrier ⋅ Siwei Han ⋅ Huaxiu Yao ⋅ Artemis Llabrés ⋅ Yiming Xu ⋅ Dimosthenis Karatzas ⋅ Hao Zhang ⋅ Anupam Datta
10:30 AM - 11:30 AM

Multimodal agents offer a compelling path to automating complex document-intensive workflows, yet a critical question remains: do these architectures demonstrate genuine strategic reasoning, or simply conduct stochastic trial-and-error search? To address this, we introduce Agentic Document VQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power and reliably differentiate between varying levels of agent capability. To rigorously assess agentic behavior, we introduce a novel evaluation protocol for measuring the accuracy-effort trade-off. Using this framework, we find that humans show strong metacognitive calibration, adapting or abandoning failed strategies, whereas frontier agents often persist in unproductive loops with diminishing returns. We release the dataset, evaluation harness, and leaderboard to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

... more
10:30 AM - 11:30 AM

Multi-agent debate (MAD) is a powerful paradigm for combining multiple large language models (LLMs) to achieve robust reasoning, but prior work has largely focused on language-only settings, leaving its multimodal potential underexplored. We present Weighted Iterative Society-of-Experts (WISE), a generalized MAD framework that systematically integrates heterogeneous multimodal LLMs to address challenging vision-and-language tasks in a zero-shot setting. Our key idea is to factor agents into three roles based on their multimodal capabilities: Solvers, which process multimodal inputs and generate candidate solutions; Reflectors, which may or may not access multimodal inputs but evaluate solutions, provide feedback, and assign weights; and an Orchestrator, which operates unimodally to reason over solutions and feedback and produce directives that guide subsequent reasoning. To account for varying agent reliability, we introduce an unsupervised probabilistic aggregation method, termed WISE–Dawid–Skene, which leverages the weighting scheme in WISE-MAD to adaptively combine agent outputs. We evaluate WISE on several challenging mathematical reasoning datasets and show that it consistently outperforms state-of-the-art methods across diverse LLM configurations, demonstrating its effectiveness as a general and scalable multimodal reasoning framework.

... more

ALPS: Adaptive Lineage-Aware Parallel Search for LLM-Driven Optimization

Junlin Chen ⋅ Haolong Jia ⋅ Daize Dong ⋅ Jiawei WU ⋅ Hongyi Wang
10:30 AM - 11:30 AM

Language-model agents are increasingly used as black-box optimization policies: they propose a code or configuration edit, evaluate it with a fixed harness, observe a scalar metric, and iterate. Serial agent loops use feedback efficiently but under-utilize parallel hardware, whereas naive parallel loops improve utilization but evaluate many candidates against stale or non-composable baselines. We formulate this tension as a scheduling problem over an asynchronous experiment lineage. We propose \alps (Adaptive Lineage-Aware Parallel Search), a scheduler that combines a lineage selector, an operator-level bandit, a commit-hazard controller for adaptive parallelism, and a noise-aware promotion gate. \alps treats the LLM as a candidate proposer and assigns all stateful decisions---dispatch, validation, commit, and rebase---to the scheduler. We evaluate \alps on two tasks: a small-GPT autoresearch benchmark and a Qwen3 supervised fine-tuning data-mixture search. Across both tasks, we compare three policies---serial, naive parallel, and \alps---under matched wall-clock budgets. Preliminary results suggest that lineage-aware scheduling can recover cumulative-improvement behavior while retaining the throughput advantages of parallel execution.

... more

ASH: Agents that Self-Hone via Embodied Learning

Benjamin Schneider ⋅ Xavier Schneider ⋅ Victor Zhong ⋅ Sun Sun
10:30 AM - 11:30 AM

Mastering long-horizon embodied tasks remains a fundamental challenge for AI, as current meth- ods often fail due to noisy data or intractable re- ward engineering. We introduce ASH, a fully autonomous agentic system that overcomes these limitations without any human involvement: no reward shaping, no expert annotation, and no domain-specific data curation. When encounter- ing an impasse, ASH uses its own experience to retrieve, and learn from relevant internet video. Evaluated in Pokémon Emerald—a complex RPG spanning dozens of hours—ASH dramatically out- performs baselines: while behavioral cloning and general purpose foundation models (Qwen Team, 2026) collapse to near-zero milestone comple- tion within the first few minutes, ASH sustains robust progression across multi-hour gameplay by continuously and autonomously acquiring new skills. This demonstrates that fully autonomous, self-improving agents are a scalable path for open- ended, long-horizon embodied learning.

... more

Dyserve: Dynamic Strategy Generation for Agent Serving

Jiayi Qian ⋅ Zishen Wan ⋅ Hanchen Yang ⋅ Souvik Kundu ⋅ Tushar Krishna
10:30 AM - 11:30 AM
Agentic workflows orchestrate optional operators (verifiers, routers, retries, escalations) but commit to a fixed pipeline per task type. This is structurally suboptimal: different requests, models, and runtime conditions favor different operator subsets. Additionally, no single hand-written workflow is uniformly Pareto-optimal. We present \textbf{Dyserve}, a framework that generates the serving strategy as a \textit{structured subgraph} over a workflow abstraction by solving an integer linear program (ILP). The ILP jointly selects per-node execution options (model, verification, speculative execution, retry) weighted by offline-profiled coefficients for service quality and latency performance. On LiveCodeBench, the generated strategies match the best-verify accuracy with $\sim\!10\times$ latency reduction. Additionally, we describe an event-driven suffix-repair extension that resolves a smaller ILP over the residual workflow when high-impact runtime events fire.
... more
10:30 AM - 11:30 AM

Enterprise environments differ fundamentally from the clean settings assumed in LLM research: knowledge is distributed across heterogeneous sources, often incomplete or inconsistent, and key procedural logic is implicitly encoded in artifacts rather than explicitly documented. In such settings, retrieval-based approaches are insufficient, as no single source contains the full workflow. We propose a replication-driven knowledge distillation framework for scalable learning in multimodal agents. The agent learns by reverse-engineering validated artifacts (e.g., Excel workbooks), reconstructing the underlying data pipeline, and distilling the inferred logic into structured knowledge (claims, procedures, and domain patterns). This enables synthesis and validation across noisy sources and supports reuse in future tasks. We evaluate on 120 simulated enterprise environments with multimodal inputs (SQL, spreadsheets, documentation, messaging app, emails, images, PDFs, CSV) and controlled noise. Our method consistently outperforms retrieval-based baselines on both task execution and conceptual understanding, and remains robust under environmental drift.

... more

Multi-agent LLM systems increasingly depend on orchestrator agents to coordinate specialist agents, yet orchestration strategies remain informal, chosen by convention rather than analysis. We formalize the orchestrator as a constrained optimization problem over a typed agent registry with capability declarations, cost profiles, and trust boundaries. We define three coordination strategies, sequential delegation, parallel fan-out, and hierarchical decomposition, and derive threshold conditions under which each minimizes a combined cost-latency objective for a given task dependency structure. We implement OrchestRAte, a strategy-switching orchestrator that selects coordination mode dynamically via dependency graph analysis, and evaluate it on three enterprise workflow benchmarks: document processing, compliance review, and incident triage. On these benchmarks, OrchestRAte reduces end-to-end cost by 28–42% relative to static sequential baselines while trading at most 0.7 percentage points of task-completion quality. The cost gains come with a caveat: on highly parallelizable tasks, a static parallel baseline still achieves the lowest raw latency because it avoids the overhead of strategy selection entirely. We identify an empirical orchestrator overhead threshold around 15% of total token budget, beyond which the coordination cost of hierarchical strategies erodes their planning advantage.

... more

Compositional text-to-image generation requires satisfying multiple attribute constraints simultaneously, a task that single-pass generation routinely fails on. Two recent lines of work address this from opposite ends: agentic iterative refinement trains only the language planner while freezing the image generator, and unified policy optimization trains language and visual generation components but in a single generation round. We present NEXUS, a work-in-progress framework combining both properties: iterative, multi-round refinement with joint RL training of both the LLM planner (Qwen2.5-7B) and the reference-conditioned diffusion editor (FLUX.1-Kontext-dev) via shared GRPO reward. A dual-channel Bridge routes continuous LLM hidden states and discrete text instructions to the editor's conditioning mechanism. Empirically, using three-seed means on the full 553-prompt GenEval prompt set scored with our Qwen-VQA compositional score rather than the official GenEval evaluator: (1) zero-shot Gemini with the same iterative refinement loop reaches 0.740 and serves as the matched baseline; (2) NEXUS full reaches 0.734 at step 50 and 0.780 at step 100, surpassing this matched Gemini baseline by +4.0 pp at step 100; and (3) planner-only and no-Bridge variants do not exceed the matched Gemini baseline at step 100, highlighting the importance of the full co-adaptive system.

... more

Zero-Shot Utility and Efficient Adaptation in Vision-Language Multi-Agent Control

Daniel Masamba ⋅ Christian I Narcia-Macias ⋅ Erik Enriquez ⋅ Dongchul Kim
10:30 AM - 11:30 AM

Vision-language models (VLMs) provide a promising starting point for multi-agent control because they can act before environment-specific training. However, it remains unclear how far this zero-shot advantage extends in partially observable cooperative settings, and how adapted VLM agents compare with specialized multi-agent reinforcement learning (MARL) policies. This work studies this question in a cooperative pursuit benchmark with controlled distribution shifts spanning visual appearance, semantic remapping, observation layout, agent counts, and environment scale. We compare zero-shot VLMs, LoRA-adapted VLMs, cold-start MARL, and fully trained IPPO/MAPPO baselines under matched local-observation constraints. Results show that zero-shot VLMs provide useful behavior where cold-start RL is nearly non-functional, and that supervised adaptation with 1k--25k expert demonstrations rapidly produces competitive long-horizon controllers. Adapted VLMs are especially robust to visual and semantic shifts, while trained MARL remains strongest on the in-distribution task and some coordination-heavy variants. Overall, the results reveal complementary scaling behavior: VLMs adapt rapidly from limited expert data and generalize better across visual-semantic shifts, while MARL remains stronger when extensive task-specific interaction is available.

... more

F-TIS: Harnessing Diverse Models in Collaborative GRPO

Nikolay Blagoev ⋅ Oguzhan Ersoy ⋅ Wendelin Boehmer ⋅ Lydia Y. Chen
10:30 AM - 11:30 AM

Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.

... more

Don’t Blind Your VLA: Aligning Visual Representations for OOD Generalization

Nikita Kachaev ⋅ Mikhail Kolosov ⋅ Daniil Zelezetsky ⋅ Alexey Kovalev ⋅ Aleksandr Panov
10:30 AM - 11:30 AM

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities.

... more

LOGIV: Logic-Graph Inference with VAL-Verification for Long-Horizon Robotic Manipulation

LU Enqiao ⋅ Zhenglin Wan ⋅ Xingrui Yu ⋅ Pengfei Zhou ⋅ Zhao ⋅ Yang You ⋅ Ivor Tsang
10:30 AM - 11:30 AM

Long-term robotic operation has long been plagued by temporal failures during execution, as static task instructions (i.e., language conditions) fail to provide dynamic guidance for complex stages. This results in stage confusion and goal deviation in multi-stage scenarios, despite the strong short-term reaction capabilities of the underlying actuators. In this paper, we demonstrate that achieving robust long-term autonomy does not require retraining the actuators or complex hierarchical architectures, but rather temporal reparameterization of the task-instruction interface. We propose a training-free dual-system reasoning framework, LOGIV (LOgic-Graph Inference with VAL-verification), which decomposes global instructions into dynamic, multi-stage natural language priors. To ensure logical consistency, we introduce a graph-based self-correction mechanism that utilizes formal verification and standardized repair operators to autonomously correct plans generated by large language models. Experimental results on datasets such as DROID, AgiBot, EgoDex, and RoboTwin 2.0 show that our framework significantly improves task success rates without modifying the underlying model weights, with more pronounced effects on more complex tasks. Due to its architecture-agnostic and lightweight nature, LOGIV offers an efficient and plug-and-play solution for bridging the long-term temporal gap in existing world action models.

... more

Persistent memory makes multimodal agents more capable, but it also creates a new attack surface: once unsupported content is written into memory, later retrieval and consolidation can reuse it as if it were reliable state. We study write-time defense for multimodal agent memory. Our system, SAGE-Mem, separates transient evidence from durable belief : observations may be stored as evidence, but they are promoted to belief only when they are sufficiently supported, independent, and non-conflicting. This targets a gap left by retrieval-time defenses, which act only after poisoned content has already entered memory. We evaluate on LoCoMo-Adv, an adversarial multimodal extension of LoCoMo-10, and on MM-BrowseComp-Adv, a multimodal browsing benchmark covering answer-overwrite, OCR, vision-caption, and visual-prompt attacks. On LoCoMo-Adv, at a conservative operating point, SAGE-Mem eliminates observed write admission and retrieval contamination relative to a retrieval-time baseline, but reduces benign completion under attack (0.460 vs. 0.642). On the canonical browsing overwrite setting, BrowseGuard, a browsing-specific write policy built on the same principle, blocks all 388 direct and paraphrased overwrite attempts while keeping attacked utility near its clean level (0.155 vs. 0.160). On the broader five-attack browsing suite, extending the same guarded write policy across browser, OCR, and caption channels reduces Write ASR from 0.2552 to 0.0369 and Retrieval ASR from 0.5636 to 0.3694. Overall, the results suggest that for memory-bearing agents, robustness should be evaluated not only at retrieval, but also at the point where observations become persistent state.

... more

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon ⋅ Sunkyoung Kim ⋅ Hyesoo Hong ⋅ Wonje Jeung ⋅ Yongil Kim ⋅ Wooseok Seo ⋅ Heuiyeen Yeen ⋅ Albert No
10:30 AM - 11:30 AM

Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

... more
10:30 AM - 11:30 AM

We present DEI: Diversity in Evolutionary In- ference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation op- erators across peer nodes communicating with non-blocking collective operations. Unlike ho- mogeneous parallel search, which replicates a single model’s inductive biases across all work- ers, DEI treats each LLM’s distinct creative prior as a complementary source of behavioral nov- elty. Extending the Digital Red Queen frame- work with DEI, nodes share local optimal solu- tions at the end of each round to seed the next round’s population. This creates cross-model ad- versarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle in- side a simulated machine, a four-node heteroge- neous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves +124% higher merged-archive QD-Score (45.90 vs. 20.46) and +28% higher coverage (80.6% vs. 63.0% of cells) than a single-node baseline at equal total LLM-call budget. The hetero- geneous ensemble also outperforms an equally- budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.

... more
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$--$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$--$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
... more
10:30 AM - 11:30 AM

Multimodal STEM reasoning remains challenging because visual grounding and computation are often entangled in a single inference pass, leading to deterministic failures. We study whether explicitly decomposing these steps at test time can improve performance on multimodal physics and mathematics problems. To this end, we propose PRISM, a multi-agent framework that separates visual grounding, textual enrichment, and program-aided reasoning. Our results on the SeePhys dataset indicate that structured decomposition is most beneficial when visual dependency and reasoning complexity are high. These findings suggest that separating perception from reasoning can be a practical alternative to inference-time scaling for multimodal STEM tasks. Furthermore, we evaluate its generalization on the MATH-Vision benchmark for mathematical reasoning, demonstrating the robustness of our method.

... more
10:30 AM - 11:30 AM

We present LS-111B, a 111B-parameter hybrid reasoning model for Korean-English enterprise agents under practical memory and serving con- straints. The model trains from a fully post- trained enterprise language model rather than a new pretraining run, and uses preamble condi- tioning to switch between concise non-reasoning behavior and longer tool-oriented reasoning. We study four choices for scaling tool-using agents efficiently: multilingual supervised fine-tuning, reinforcement learning with verifiable rewards for multi-step tool-use tasks, language-consistency re- wards for Korean user-facing responses, and 4-bit quantization for single-GPU serving. The adapted model improves mathematical reasoning, func- tion calling, and agentic natural-language-to-SQL (NL2SQL) performance while preserving general Korean and English instruction-following qual- ity. These results provide a practical recipe and failure-mode analysis for adapting post-trained multilingual models to verifiable agentic work- flows under memory-constrained deployment.

... more

Specialist VLA: Planner-Routed LoRA Specialization for Long-Horizon Robotic Manipulation

Christian I Narcia-Macias ⋅ Daniel Masamba ⋅ Erik Enriquez ⋅ Dongchul Kim
10:30 AM - 11:30 AM

Long-horizon robotic manipulation requires chaining semantically distinct primitives, reaching, grasping, moving, and placing, where per-primitive failures compound into end-to-end task breakdown. Existing hierarchical approaches use VLM planners to decompose instructions but route all primitives through a single generalist low-level controller, creating a planner-executor mismatch that limits reliability. We propose Specialist VLA, which retains a shared frozen TinyVLA-1.3B backbone while activating primitive-specific LoRA adapters selected by a Gemini-based planner, with dynamic re-querying every 30 control steps for mid-execution recovery. On a pick-and-place benchmark in Robosuite, Specialist VLA achieves 90% full-task success with re-querying versus 62% for a generalist baseline, demonstrating that primitive-level specialization and adaptive planning are complementary bottlenecks for long-horizon manipulation.

... more
Polysemanticity, where a single neuron responds to multiple unrelated concepts, is a central obstacle in mechanistic interpretability, yet the field lacks a principled continuous scalar for it. We introduce the $\textbf{Dirichlet Process Polysemanticity Index}$ (DPMI), a per-neuron score that combines inferred component count and component separation by fitting a non-parametric DPGMM and weighting by mean pairwise Jensen-Shannon divergence. On a controlled toy benchmark, DPMI achieves Spearman $\(\rho = 0.755\)$ and $AUROC \(= 0.877\)$, outperforming seven baselines. Against an independent Fourier-analytic ground truth from a modular-arithmetic transformer, DPMI remains significant with $\rho = 0.255\$ $\(p < 10^{-8}\)$. Across six architectures, we find a robust cross-modal law: language models are significantly more polysemantic than vision models $\(d = 0.803\$, $p < 10^{-129}\)$. Ablations show the non-parametric prior is essential (removing it drops $\(\rho\)$ by $0.040\-0.045\)$, and DPMI-guided SAE budget allocation improves reconstruction $\(R^2\)$ for the most polysemantic quartile by \(+0.010\) at fixed compute.
... more
A widespread assumption in mechanistic interpretability holds that chain-of-thought (CoT) reasoning unfolds through discrete, recoverable cognitive phases-a prediction that would enable phase-specific circuit analysis and steering interventions. We test this using Switching Linear Dynamical Systems (SLDS) applied to residual-stream activations of DeepSeek-R1-Distill-Llama-8B across 997 MATH-benchmark traces at layer~16, complemented by a boundary diagnostic and a variance-discrimination analysis. Phase boundaries produce statistically significant but metrically weak distributional shifts (PC2: Cohen's $d = -0.293$, $p = 8.5\times10^{-6}$), and PCA directions are statistically independent of phase-discriminative directions (Spearman $\rho = -0.025$, $p = 0.78$), explaining why standard dimensionality reduction systematically discards the phase signal. Across all three experimental conditions and hyperparameter regimes, SLDS fails categorically to recover phase sequences (NMI $\leq 0.005$); inferred states instead capture positional structure ($\chi^2 = 2343$, $p \approx 0$) and syntactic token-type patterns ($\chi^2 = 293$, $p < 10^{-44}$). We conclude that CoT reasoning is a \emph{continuous dynamical process}: discrete-phase interpretability frameworks will systematically underfit residual-stream dynamics, and continuous-trajectory approaches are necessary.
... more

Announcing an MCP tool in the system prompt cuts agent cost by 56% on Claude Sonnet 4.6 (N=30, cold cache); the default Claude Code deployment omits that announcement and captures only 19% of the gap (-10.9% vs. -56.3%). Same server, same execution, same cache regime; only the system-prompt addition differs. A pure-announcement variant (announcement text, no use-instruction) reaches -45.8%, 77% of the gap, attributing the rest to instruction priming. Practitioner benchmarks reporting 32-100x MCP savings (OnlyCLI 2025; Speakeasy 2025) measure the announced regime; users running default tooling do not. Prior benchmarks do not separate this from primitive type. To isolate it, we decompose six agent memory primitives along three axes (schema availability, schema discoverability, execution locality) and run controlled contrasts. CLI-vs-script holds locality fixed and varies only schema availability; eager-vs-lazy MCP holds the server fixed and varies only announcement. Across three task families (file report, multi-file Python refactor, code Q&A) at N=30 with per-task UUID prefix-cache defeat, CLI and eager MCP overlap at -56% on report and within 9pp on the other families: locality contributes no detectable savings on report or refactor and at most ~9pp on Code Q&A. A UserPromptSubmit hook reaches -80% by pre-executing the work entirely. Task quality is preserved across all primitives (oracle pass rates 1.00/1.00/0.99 on report/refactor/Code Q&A). An exact additive re-parameterization (three axes plus a pre-execution scaling term that captures the hook ceiling) decomposes per-primitive means with moderately stable axis coefficients across families; schema discoverability alone accounts for ~40% of baseline cost. Per-seed cost correlates with agent turn counts (Pearson r=0.92 / 0.88 / 0.71 on report / refactor / Code Q&A), consistent with the three axes acting through agent reasoning volume. The ordering replicates under warm cache on Claude Sonnet 4.6 and Opus 4.7, with one notable exception: Claude Haiku 4.5 discovers the lazy-MCP schema unaided (lazy collapses to -71%), so the discoverability axis is mediated by the model's tool-discovery behavior, not a fixed structural property. Benchmark and reference implementations released as supplementary material.

... more

Beyond the Moment: Conditioning Frozen VLAs on Memory for Long-Horizon Manipulation Tasks

Aditya Ramesh ⋅ Jatin Chauhan ⋅ Akshay G Srinivasan ⋅ Shivam Bhardwaj ⋅ Manohar Kaul
10:30 AM - 11:30 AM

Foundation robotics models, most popularly Vision Language Action (VLA) models, struggle to perform well over long horizon tasks due to their reliance on immediate sensory input. This induces compounding errors over inference timesteps, further exacerbated by non-robust backbones. To address this, we introduce \textit{Training-Free Memory Conditioned Action Generation}, a non-parametric retrieval-augmented framework that conditions a frozen VLA on historical expert trajectories. Our approach constructs a memory of expert demonstrations and utilizes a state-centric retrieval mechanism to guide action generation without any fine-tuning whatsoever. By performing extensive evaluation on 5 datasets over SOTA models, we show relative gains of upto 27% on task completion success. As an additional contribution we extend the popular CALVIN benchmark to task horizon of 6 and beyond, showcasing relative gains of upto 30%, while also demonstrating robustness to corrupted observations. Real-world experiments on complex tasks further demonstrate performance gains of up to 2X.

... more
10:30 AM - 11:30 AM

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

... more
10:30 AM - 11:30 AM
11:30 AM - 12:30 PM

Invited talk Session 2

Jiao Sun ⋅ Mike Zheng Shou ⋅ Le Song
12:30 PM - 2:00 PM
2:00 PM - 2:30 PM
2:30 PM - 3:00 PM

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

... more

LongMemEval-V2: Benchmarking Agent Memory for Experienced Colleagues

Di Wu ⋅ Zixiang Ji ⋅ Asmi Kawatkar ⋅ Bryan Kwan ⋅ Jia-Chen Gu ⋅ Nanyun Peng ⋅ Kai-Wei Chang
2:30 PM - 3:00 PM

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, workflows, state dynamics, and recurring failure modes. We introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can accumulate environment-specific experience from multimodal web agent trajectories. LME-V2 contains 451 manually curated questions from customized shopping, forum, admin, and ServiceNow-style environments, with histories ranging from 25M to 115M tokens. Frontier LLMs reach at most 14.1% without trajectory evidence, confirming that LME-V2 requires learned experience beyond parametric knowledge. We evaluate memory under a context-gathering formulation and propose AgentRunbook: AgentRunbook-R is an efficient RAG pipeline over raw states, transitions, and notes, while AgentRunbook-C uses a scaffolded coding agent to gather evidence from trajectory files. AgentRunbook-C achieves the best overall accuracy, reaching 74.9% on LME-V2-Small and 70.1% on LME-V2-Medium, while improving the accuracy and latency trade-off over an off-the-shelf coding agent. We will release the benchmark and memory implementations.

... more

Exploratory and Assimilating Reflection: Reflective Recall Cycle for Long-term Memory

Ganesh S ⋅ Moyuru Yamada ⋅ Ishan Jindal ⋅ Kiran Purohit
2:30 PM - 3:00 PM

LLM-based autonomous agents require external memory to overcome their statelessness and limited context window for long-term interaction and dynamic knowledge reasoning. However, existing memory retrieval methods often lack adaptability and sample efficiency, and struggle to retrieve the right mixture of memories from heterogeneous stores. We propose \textit{Exploratory-Assimilating Reflection (EAR)}, a framework for high initial retrieval performance and sample-efficient adaptation. EAR combines two mechanisms: Exploratory Reflection, which performs iterative search to bootstrap retrieval and collect useful experiences for each query, and Assimilating Reflection, which replays these experiences from an Experience Buffer to refine a global reranker more efficiently than methods relying only on immediate rewards. Experiments show that EAR improves retrieval by up to 17.9% over the baseline retriever on two long-term dialogue benchmarks. We also show that EAR is highly sample-efficient and robust to noisy feedback.

... more
2:30 PM - 3:00 PM
3:00 PM - 3:15 PM

Invited talk Session 3

Chelsea Finn ⋅ Minhyuk Sung
3:15 PM - 3:45 PM

Closing Remark

Souvik Kundu ⋅ Digbalay Bose ⋅ Jaehong Yoon
4:15 PM - 4:20 PM
4:20 PM - 5:00 PM