2026 Position Papers
Position: Interestingness is an Inductive Heuristic for Future Compression Progress
Vincent Herrmann ⋅ Jürgen Schmidhuber
This position paper argues that truly open-ended intelligence is bottlenecked by the challenge of *interestingness*: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under various priors over computable objects, we demonstrate that the *inductive property of interestingness*—the capacity for past compression progress to signal future discovery—is theoretically viable. However, we show that this property is highly sensitive to the underlying distribution of objects. We conclude by calling for a move beyond human-in-the-loop filtering or data creation, and a shift toward *introspective* models that can explicitly assess their own potential for insight. Furthermore, we advocate the engineering of scale-free synthetic environments, providing a principled roadmap for the development of truly autonomous open-ended systems.
Show more
Position: It is Time to Virtualize Foundation Models with a Self-evolving Operating System Layer
Suparna Bhattacharya ⋅ Tarun Kumar ⋅ Cong Xu ⋅ Satish Mopur ⋅ Jiahao Li ⋅ Ashish Mishra ⋅ Aalap Tripathy ⋅ ANNMARY KOOMTHANAM ⋅ Martin Foltin ⋅ Ian Foster
AI applications have shifted from single, mono-lithic foundation models (FM) to compound agentic systems. Yet today’s stacks remain fragmented: even as protocols (e.g., MCP, A2A) ease tool/agent connectivity, each framework embeds an implicit runtime for state, memory, budgets, and guardrails, making behavior non-portable and governance brittle. It mirrors computing before operating systems, when every program re-implemented basic services. This position paper argues that the field now needs a Foundation Model Operating System (FMOS): a system layer that virtualizes FM interactions analogous to how virtual machines abstract physical hardware, giving applications the illusion of dedicated, trustworthy FM instances with effectively unbounded capabilities. Internally, the FMOS orchestrates knowledge across memory tiers, model selection and resource allocation, and verification and policy enforcement. Like the human brain switching between fast intuition and slow deliberation, the FMOS learns when to intervene and when to let inference proceed directly and continuously adapting its policies based on operational experience.
Show more
Position: Spatial Fairness: Foundations, Pitfalls, and a Path Forward
Nripsuta Saxena ⋅ Abigail Horn ⋅ Wenbin Zhang ⋅ Cyrus Shahabi
Despite location being increasingly used in decision-making systems deployed in sensitive domains such as mortgages and insurance, little attention has been paid to the unfairness that may seep in due to the correlation of location with characteristics considered protected under anti-discrimination law, such as race or national origin. This position paper argues for the urgent need to consider fairness with respect to location, termed $\textit{spatial fairness}$. It outlines the harms perpetuated through location's correlation with protected characteristics, which may be particularly consequential due to its treatment as a neutral or purely technical attribute, abstracted from its historical, political, and socioeconomic context. This interdisciplinary work connects knowledge from fields such as public policy, economic development, and geography to highlight how existing fair-AI research falls short in addressing spatial biases, and fails to consider challenges unique to spatial data. Furthermore, we identify limitations in the small body of prior work on spatial fairness work, and propose guidelines to inform future research aimed at mitigating spatial biases in data-driven decision-making systems.
Show more
Position: AI Capabilities Are Not Increasing Exponentially
Haosen Ge ⋅ Hamsa Bastani ⋅ Osbert Bastani
Rapidly increasing AI capabilities have substantial real-world consequences, ranging from AI safety concerns to labor market consequences. The Model Evaluation & Threat Research (METR) report argues that AI capabilities have exhibited exponential growth since 2019. In this position paper, we argue that the data does not support exponential growth, even in shorter-term horizons. Whereas the METR study claims that fitting sigmoid/logistic curves results in inflection points far in the future, we fit a sigmoid curve to their current data and find that the inflection point has already passed. In addition, we propose a more complex model that decomposes AI capabilities into base capabilities and reasoning capabilities, exhibiting individual rates of improvement. We prove that this model supports our hypothesis that AI capabilities will exhibit an inflection point in the near future. Our goal is not to establish a rigorous forecast of our own, but to highlight the fragility of existing forecasts of exponential growth. Finally, we call for the design of more rigorous evaluation methodologies for AI forecasts, and for better academic discussion on this topic.
Show more
Position: `AI Alignment' Encompasses Competing Technical Priorities
Tushita Jha ⋅ Rory Švarc ⋅ Mateusz Bagiński
The ML literature contains many distinct concepts falling under the heading of ‘AI alignment’. After noting three concepts of AI alignment and situating these ideals in the context of their corresponding research programs, we claim that realistic interventions may promote ‘AI alignment’ under one conception while being actively counterproductive from the perspective of others. We suggest that tensions between alignment ideals emerge due to differences in background threat-models, alongside differences in both methodological and normative orientations. In light of our analysis, researchers taking themselves to produce research aimed to further the goal of ‘AI alignment’ should do three things. First, they should distinguish between ‘AI alignment’ as a high-level ideal and the specific ‘alignment proxies’ used in empirical research. Second, they should use more granular concepts to identify the source in addition to the nature of possible AI harms/benefits. Third, they explicitly specify the non-technical background commitments motivating specific conceptions of ‘AI alignment’.
Show more
Position: Ideas Should be the Center of Machine Learning Research
Jairo Diaz-Rodriguez
Machine learning research increasingly bifurcates into two disconnected modes: benchmark-driven engineering that prioritizes metrics over understanding, and idealized theory that often fails to transfer to modern systems . In this position paper, we argue that the field focuses too heavily on these endpoints, neglecting the central scientific object: the idea. We propose an Ideas First framework in which *ideas* are valued for the behavioral *signatures* they predict in modern models, and these signatures are tested through *tailored experiments* designed to detect the relevant patterns rather than to win leaderboards. This shift not only bridges the gap between theory and practice but also promotes equity by removing the "complexity premium", enabling rigorous scientific contributions from researchers with modest computational, financial, and human resources. Ultimately, we advocate for a research culture centered on ideas, treating benchmarks and theorems as instruments for testing mechanistic hypotheses rather than as ends in themselves.
Show more
Position: Age Estimation Models Do Not Process Biometric Data
Nikita Marshalkin
When a neural network estimates someone's age from a photograph, does it process biometric data? The answer depends on whether identity-discriminative representations arise within the network during inference—a question that may seem trivial to ML researchers but triggers consent requirements under GDPR, statutory damages under BIPA, or high-risk AI classification under the EU AI Act. Yet no regulatory guidance addresses it. This position paper provides empirical evidence: 14 models evaluated across 3 face verification benchmarks show age estimators fall orders of magnitude short of identification thresholds. Age estimation models cannot identify individuals. We call on researchers to provide transparency about what systems store and can do, and on regulators to distinguish transient processing from template storage.
Show more
Position: The Age of AI Agents Demands A New Scientific Paradigm To Sustain Trustworthy Science
Belinda Mo
AI systems are becoming autonomous research agents that generate hypotheses, design experiments, and produce discoveries at scales beyond human oversight. As seen by increased submissions to ML venues, the verification gap between scientific output and our ability to check it is already widening, and autonomous agents make it worse by magnitudes given human-agent asymmetry. We argue that science must evolve its verification infrastructure, as it has before with peer review. However, while historical adaptations assumed human contributors who could be questioned and sanctioned, AI agents break this assumption. We propose criteria for an adapted verification infrastructure that emphasizes observable-by-default workflows, scalable verification, and clear attribution. We argue that without adaptation, ML and any scientific domain using agents face dangerous failures: experimental results that no person can verify, optimization for metrics over understanding, and accountability vacuums that erode scientific trust.
Show more
Position: Profiling Game Worlds by Transition Complexity
Lele Cao
Game world modeling (GWM) and reinforcement learning (RL) are often confounded because research papers rarely quantify how difficult the underlying transition prediction problem is at the declared interface (pixels/tokens/latents with finite history). We propose the Transition Complexity Profile (TCP): a small, reproducible set of metrics that characterizes an environment's (or gameplay dataset's) induced transition kernel by (i) intrinsic one-step branching, (ii) interaction-induced uncertainty and opponent influence when observable, and (iii) temporal/spatial dependency span via standardized probe curves. TCP is reported with an explicit reference distribution, protocol stochasticity, and a versioned measurement budget (sampling/resampling and fixed probe compute), enabling comparable numbers across benchmarks. We outline how common game families and modern "neural game engine" domains populate this landscape and call for TCP to become standard benchmark metadata and a required statistic in GWM and RL papers.
Show more
Position: Model identity in machine learning is a convention, not a property
Vacslav Glukhov
Treating the outcome of machine learning as a stable, identifiable artifact is implicit in language, tooling, and governance. This position paper examines whether a trained system admits context-appropriate criteria of identity. We show that neither functional behavior nor internal structure suffices: behavioral equivalence is underdetermined by finite data, while modern architectures admit multiple, structurally distinct realizations of the same function. Consequently, practices that treat learned systems as stable objects presuppose equivalence relations that are rarely made explicit. We do not propose abandoning such practices. Instead, we articulate the minimal conditions under which identity claims grounded in behavior, structure, or training process can be meaningfully interpreted, with implications for reproducibility, traceability, and governance.
Show more
Position: Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services
Guoheng Sun ⋅ Ziyao Wang ⋅ Xuandong Zhao ⋅ Bowei Tian ⋅ Zheyu Shen ⋅ Yexiao He ⋅ Jinming Xing ⋅ Ang Li
Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textit{quantity inflation}, where token and call counts may be artificially inflated, and \textit{quality downgrade}, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.
Show more
Position: Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks
Enkelejda Kasneci ⋅ Gjergji Kasneci
This position paper argues that effective tutoring requires **corrective friction**: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade **epistemic rigor** for agreeableness. We identify a **Reasoning-Sycophancy Paradox**: models that resist **context-switch** frame attacks can still capitulate under social-epistemic pressure, especially **authority** ("my notes say I’m right") and **social-affective face-saving** ("please don’t tell me I’m wrong''). We introduce **EduFrameTrap**, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure *social-epistemic courage*, i.e., supportive but corrective tutoring, and treat *kind-but-correct* behavior as a safety requirement.
Show more
Position: Preregister Experiments with AI Agents
Michelle Vaccaro
The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in silico" behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance—as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit—and in some cases amplify—methodological vulnerabilities that have long plagued human subjects research. To address these issues, this position paper argues that preregistration practices—central to improving the credibility of human subjects experiments—should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce—model selection, prompt wording, settings, and outcome-contingent redesign, for example—and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.
Show more
Position: Certified Correctness in Neural Constraint Reasoning Requires Symbolic Integration
Shufeng Kong ⋅ Xiaochuan Zhang ⋅ Caihua Liu
Neural solvers for constraint satisfaction problems have achieved remarkable in-distribution accuracy, yet they suffer from a fundamental limitation where persistent constraint violations occur under distribution shifts even when the model reports high confidence. This position paper argues that when hard constraints exist and the cost of verification is relatively low, neural constraint reasoning must prioritize symbolic integration over pure learning. We justify our focus on Sudoku as a representative NP-complete testbed because it exhibits a sharp asymmetry between easy verification and hard solving; specifically, checking a candidate solution requires only polynomial time $O(n^{2})$ while finding a solution may require exponential search. Through a comprehensive survey of solving methods spanning deterministic algorithms, metaheuristic optimization, learning-based approaches, and language-conditioned reasoning, we demonstrate that neural-only methods without instance-level certification fail to achieve the provable correctness that symbolic and neuro-symbolic approaches provide. We advocate for a bidirectional integration where neural methods enhance symbolic solvers by learning heuristics and converting perceptions into symbols, while symbolic methods verify neural outputs to ensure their reliability. To operationalize this position, we propose a multi-agent certified reasoning framework that demonstrates how this integration can achieve both computational efficiency and provable correctness.
Show more
Position: When AI Decides Who Gets an Organ: Multi-Agentic AI Systems in Transplant Medicine Risk Amplifying Disparities Without Targeted Explainability and Deployment Strategies
Divya Sharma ⋅ Ghazal Azarfar ⋅ Bima Hasjim ⋅ Mamatha Bhat
Agentic AI systems particularly those built on large language models (LLMs) and deployed as autonomous, role-specialized agents are rapidly emerging in clinical decision-making. This position paper argues that without equity and explainability as core design constraints, such systems will exacerbate healthcare disparities. Using empirical evidence from a multi-agent simulation of a liver transplant selection committee, we demonstrate that even high-performing agents can systematically disadvantage patients based on sex, ethnicity, and socioeconomic status. These disparities arise from agents’ reliance on non-clinical proxy variables (insurance type, education level, area deprivation index) and are compounded by the lack of case-level explanations and temporally grounded reasoning. We further contend that without fairness-aware deployment strategies, such systems cannot be reliably audited or ethically integrated into real-world care. In response, we propose a technical roadmap with subgroup-sensitive learning objectives, counterfactual reasoning modules, clinician-in-the-loop governance, and deployment protocols that address the digital divide. We urge the machine learning community to center explainability and health equity in the development and deployment of agentic AI for medicine especially in high-stakes domains where algorithmic decisions may determine who lives and who does not.
Show more
Position: Evidence and Implications of Texture Bias in Deep Neural Networks
Ali Kayyam
Whether deep vision models recognize objects primarily by shape or texture remains a central and unresolved question in computer vision. Early studies report a strong texture bias in convolutional neural networks (CNNs), while other work reports shape-biased representations. We argue that much of this apparent discrepancy reflects methodological confounds and a conflation of local contour sensitivity with genuine global shape understanding. Using minimal, tightly controlled stimuli, we directly compare cue-conflict and cue-suppression paradigms within a unified experimental framework. We show that standard CNNs consistently prioritize texture over global shape when cues compete, even when shape information is explicitly available. Evidence for shape bias typically reflects reliance on local fragments rather than invariant, relational representations of object structure. Our findings support the view that texture bias is fundamentally rooted in architectural inductive biases rather than data or optimization alone. This gap has direct consequences for robustness, safety, and generalization, and motivates the development of architectures that explicitly support global integration and relational reasoning, moving beyond incremental data-driven fixes.
Show more
Position: AI Researchers Must Lead Arms Control to Mitigate Military AI Risks
Ted Fujimoto ⋅ Benz
The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be applied defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.
Show more
Position: Breaking the Dual Curse of Multilingual AI Requires Socio-Technical Guardrails, Not Post-Hoc Alignment
Jason Lucas ⋅ Pureheart Ogheneogaga Irikefe ⋅ Adaku Uchendu ⋅ Umniya Najaer ⋅ Cornelius Adejoro ⋅ Patrice Sterling ⋅ Dongwon Lee
Large language models are deployed globally as universal systems, yet their safety mechanisms remain English-optimized. This creates a Dual Curse for speakers of low-resource languages: a Harmfulness Curse where harmful content generation rises from 1\% in English to 35\% in languages like Hausa, Igbo, and Javanese, and a Relevance Curse where instruction-following drops by 20 percentage points, making these systems simultaneously more dangerous and less useful. Drawing on a PRISMA-guided systematic review of 207 studies, we demonstrate that this disparity stems from a pre-training bottleneck: reward models achieve only 49--50\% accuracy in low-resource languages (equivalent to random chance), rendering post-hoc alignment structurally ineffective. These technical failures become governance hazards when at least 22 countries mandate automated content moderation, creating an infrastructure that is exploitable for censorship. Therefore, we propose a socio-technical framework addressing this inequity: (1) safety context distillation during pre-training (achieving 78--89\% harm reduction); (2) participatory harm specification by affected communities; and (3) evaluation metrics jointly tracking attack resistance and false refusal rates across languages.
Show more
Position: World Models as an Intermediary between Agents and the Real World
Sherry Yang
Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low-cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high-cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high-cost actions such as extreme off-policy learning and sample inefficiency in long-horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.
Show more
Position: Assistive Agents Need Accessibility Alignment
Jie Hu ⋅ Changyuan Yan ⋅ Yu Zheng ⋅ Ziqian Wang ⋅ Jiaming Zhang
Assistive agents, especially those intended to sup- port Blind and Visually Impaired (BVI) users, require accessibility alignment as a first-class de- sign objective. Despite rapid progress in agen- tic AI, most current systems are designed and evaluated under assumptions that implicitly cen- ter sighted users, leading to systematic failures in assistive scenarios that cannot be addressed by model scaling or post-hoc adaptations alone. Based on an analysis of 778 real-world assistance instances involving BVI users, we show that these failures arise from persistent mismatches between agent capabilities and the accessibility-specific needs, risks, and interaction constraints of visu- ally impaired users. We argue that accessibil- ity should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce the notion of accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, and post- deployment iteration. We conclude that BVI users centered assistive tasks provide a critical stress test for agentic AI and motivate a shift toward more inclusive agent design.
Show more
Position: Carbon Footprint Reporting Should Be Routine in Machine Learning Research
Guan-Ming Chiu
In this position paper, we argue that the machine learning community should adopt standardized carbon footprint reporting as part of routine scientific practice. Training large models can emit hundreds of tons of CO2, yet environmental costs remain largely invisible in publications. We contend that without energy and emissions metrics, claims of model efficiency are incomplete: a method cannot be deemed ''efficient'' without specifying efficient at what. This gap undermines scientific rigor and reproducibility, as identical experiments in different locations yield vastly different carbon footprints. We put forth reporting guidelines comprising five standardized metrics, practical measurement tools, and integration with community benchmarks, with a phased three-stage adoption process. We address alternative views, including concerns about measurement complexity and potential barriers for resource-limited researchers. To promote equity, we advocate for dual reporting of energy and carbon, reference-grid normalization, and acceptance of approximate estimates. This paper calls on venues, reviewers, authors, and institutions to establish carbon awareness as a foundational element of responsible ML research.
Show more
Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery
Tyler McCormick
Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLM)s, are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with LLMs, which tend to collapse large equivalence classes of explanations into a single fluent narrative. We propose concrete standards for ``mechanistic ML'': mechanistic claims must (i) declare identifying assumptions, (ii) pass mechanism-discriminating evaluations (interventions, invariances, derivative constraints), or (iii) report the surviving multiplicity, including explicit falsifiers and sensitivity to assumptions. These norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.
Show more
Position: Assistive AI requires Personalized Specialists, not Generalists
Homanga Bharadhwaj
The AI community is rapidly converging on generalist foundation models trained on web-scale data. While this paradigm has yielded impressive gains, in this paper we argue that this objective needs to shift for enabling AI that reliably helps people in their daily activities. The most valuable systems will not be those that attempt do everything for everyone, but those that do the right things for a specific individual over a long period of time. We take the position that \emph{specialist models}---defined not by narrow task taxonomies, but by a tight coupling to an individual user and their local environment---represent the endgame for high-impact assistive AI. We substantiate this argument through three case studies: (i) AI agents that need to help humans automate daily web activities; (ii) wearable assistants that must predict actions in-context from continuous egocentric streams; and (iii) home robots that require helping humans in daily tasks with safety and compliance guarantees. In these settings, standard scaling assumptions are inverted: the most critical data is generated \emph{after deployment} as a streaming, on-policy interaction trace. We outline research directions for building specialists that learn from organic observational data, avoid self-reinforcing errors, and improve safely over long horizons.
Show more
Position: Benchmarks Do Not Measure Deployment Readiness in Clinical AI
Haoran Zhang ⋅ Hyewon Jeong ⋅ Olawale Salaudeen ⋅ Walter Gerych ⋅ Nigam Shah ⋅ Marzyeh Ghassemi
Despite large language models (LLMs) achieving impressive performance on benchmark tasks such as medical question answering, their real-world utility remains limited. We argue that while benchmarks play a valuable role in developing methods and filtering promising models during development, they often tell us very little about deployment readiness. Many health AI systems with strong retrospective accuracy have failed in practice, while others with modest benchmark performance have demonstrated meaningful clinical benefits. We detail the limitations of benchmark-centric evaluations of deployment readiness. We argue that we should only use benchmarks to find candidate methods or models, not to justify deployment. We call for increased use of prospective studies and policy changes that align incentives with clinically grounded evaluation.
Show more
Position: Reasoning After Perception Means Reasoning Without Vision
Hongcheng Gao ⋅ Zihao Huang ⋅ Jingyi Tang ⋅ Lin Xu ⋅ Xinhao Li ⋅ Haoyang Li ⋅ Yue Liu ⋅ Minhua Lin ⋅ Xinlong Yang ⋅ Taihang Hu ⋅ Ge Wu ⋅ Baolong Bi ⋅ Hongyu Chen ⋅ Zhiqi Huang ⋅ Wentao Zhang
A common belief in multimodal research is that the perceptual weaknesses of vision--language models can be compensated by stronger language reasoning (e.g., chain-of-thought, in-context learning, or external tools). We challenge this assumption. We argue that for a broad class of visual tasks hard to specify in language, failures stem from a structural fatality where the temporal decision of \textit{when} to reason strictly dictates the spatial constraint of where reasoning takes place. When visual reasoning is deferred to language generation, current architectures do not merely delay computation; they displace it from the continuous visual representation to a discrete textual space. Consequently, the sequential "Perception-then-Reasoning" paradigm degenerates perception into a passive, one-off feature encoding process, rendering it functionally equivalent to "Reasoning-in-Text-Space", where task-critical spatial signals are collapsed before reasoning begins. We substantiate this claim with the Turing Eye Test (TET): tasks that must be resolved in visual space and are hard to verbalize; results show text-only reasoning cannot remedy these perceptual failures. Our findings suggest rethinking the architectural divide: shifting from reasoning \textit{about} perception to reasoning within perception. This facilitates actively reasoning-driven perception that operates directly on pixel-level visual representations, rather than within a collapsed textual space.
Show more
Position: Unlabeled ≠ No Human Supervision in Visual Learning
Dong Lao
This position paper argues that the absence of labels does not imply the absence of human supervision in visual learning, and therefore urges the research community to explicitly identify sources of supervision, rather than grouping all label-free approaches under the umbrella term "unsupervised". Many recent methods in computer vision build upon pre-trained representations learned from large-scale unlabeled data, and are therefore regarded as requiring no human supervision. We argue that this view conflates label-free learning with human-free learning, as data curation and filtering inevitably embed substantial human priors on which modern learning systems rely. This confusion risks gatekeeping fundamental unsupervised learning research, a trend reflected in the surprising decline of the term “unsupervised” in paper titles following the rise and widespread adoption of self-supervised pre-training, despite continued growth of the field. Rather than questioning the legitimacy of foundational pre-training within unsupervised learning, we advocate for greater conceptual clarity by encouraging authors to disclose data distribution priors and data-selective biases, and to specify which components of a learning pipeline depend on which assumptions. Standardized disclosure practices can improve academic communication, ensure fairer comparisons, and preserve methodological diversity in unsupervised learning.
Show more
Position: AI Must Become Planet-Centered, Not Human-Centered
Maria Perez-Ortiz
This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks—many of which remain human-centered—and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.
Show more
Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots
Yizhu Wen ⋅ Nan Zhang ⋅ Haohan Yuan ⋅ Xun Chen ⋅ Haopeng Zhang ⋅ Hanqing Guo
Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines' evidence pool and generation. We analyze the search engine optimization (SEO) to the GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk (iii) academic–industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. **This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.** Companion demonstration website: https://anonymous.4open.science/w/Position-GEO-AE91/
Show more
Position: *Beyond Text* The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future
Deepak Piskala
This position paper argues that the machine learning community should prioritize speech-native architectures that treat audio as a first-class modality, anticipating the inevitable shift from text-dominated to speech-first data distributions. Text dominates human-computer interaction not because it is cognitively natural, but because decades of interface design conditioned users to express knowledge through keyboards and search boxes. Recent advances in speech recognition and multimodal foundation models have removed the technical barriers to voice-based interaction; what remains is primarily a habit problem. As voice becomes habitual, the data ecosystem underlying machine learning will shift toward speech-native knowledge—with profound implications for model architecture, training efficiency, and evaluation paradigms. This paper examines the technical readiness of speech systems, identifies habit inertia as the primary adoption barrier, addresses alternative views that favor text-centric approaches, and outlines a research agenda for ML systems that anticipate speech-first data distributions.
Show more
Position: Video LLMs Must Not Ignore the Pixel Dynamics in Plain Sight
Shayda Moezzi ⋅ Umer Saleem ⋅ Andong Deng ⋅ Chen Chen ⋅ Sarah Ostadabbas
The essence of video lies in pixel dynamics: motion, state transitions, and the flow of visual information across frames. Video Large Language Models (LLMs) have rapidly become the dominant paradigm for video understanding in computer vision, sophisticated multimodal reasoning over complex, long-form visual streams. In this position paper, we argue that recent progress in video understanding is measured by benchmarks and protocols that can be solved without reliably perceiving spatiotemporal evidence, rewarding language-driven plausibility over video-grounded inference. We identify two coupled failure modes that consistently emerge across recent Video LLM evaluations: (i) static-cue dominance, where appearance and context outweigh spatiotemporal evidence, and (ii) prior-driven temporal hallucination, where learned regularities fill in temporal and causal structure when dynamics are subtle or counterintuitive. We synthesize recent diagnostic probes that expose these failure modes into a call to action for the community: to re-center video understanding on what a video uniquely contains, namely, dynamic evidence that unfolds over time, by enforcing spatiotemporal grounding in both models and benchmarks, before the pixel dynamics are lost in plain sight.
Show more
Position: LLM Benchmark Datasets should be Contamination-Resistant
Ali Al-Lawati ⋅ Jason Lucas ⋅ Dongwon Lee ⋅ Suhang Wang
Benchmark datasets are critical for reproducible, reliable and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e. *contaminated*, which diminishes their value as a reliable measure of model generalization. In this position paper, we argue that benchmark datasets should be *contamination-resistant*, i.e. *unlearnable* but support *inference*. To accomplish this, we first underline the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) develop supporting methods and platforms, and (iii) adopt contamination-resistant benchmarks into existing evaluation pipelines.
Show more
Position: We Need AI Efficiency Incentives for Accessibility and Sustainability
Marco Bornstein ⋅ Amrit Singh Bedi
The race for artificial intelligence (AI) dominance often prioritizes scale over efficiency. Hyper-scaling is the common industry approach: larger models, more data, and as many computational resources as possible. Using more resources is a simpler path to improved AI performance. Thus, efficiency has been de-emphasized. Consequently, the need for costly computational resources has marginalized academics and smaller companies. Simultaneously, increased energy expenditure, due to growing AI use, has led to mounting environmental costs. In response to accessibility and sustainability concerns, this position paper argues for research into, and implementation of, market-based methods that incentivize AI efficiency. We believe that incentivizing efficient operations and approaches will reduce emissions while opening new opportunities for academics and smaller companies. As a call to action, we propose a cap-and-trade system for AI. Our system provably reduces computations for AI deployment, thereby lowering emissions and monetizing efficiency to the benefit of of academics and smaller companies.
Show more
Position: Stop Reactively Patching Your Model Every Time and Start Proactive Test-Driven AI Development
Nadine Chang ⋅ Maying Shen ⋅ Jialiang Wang ⋅ Rafid Mahmood ⋅ Jose Alvarez
Many modern AI systems are designed to operate under diverse, open-ended, use-cases. To help generalize deployed systems, developers rely on a reactive AI flywheel that observes emerging feedback from user behavior (errors) and patches the model accordingly. However, most flywheels ignore the broader context of these errors within the system's objectives, failing to preempt potential future edge cases, which leads to more unnecessary flywheel iterations. Also, it is statistically increasingly difficult to collect remaining errors due to the long-tail nature of open-world use-cases (Boneh and Hofri, 1997). This position paper argues that a *proactive test-driven flywheel* is required to address reactive flywheel's limitations and to approach a generalizable system. We advocate for creating a ``test space" to technically map feedback data to task objectives, evolving the flywheel from reactive to proactive. We augment our position by mathematically proving a proactive one achieves better long-term scaling with fewer iterations than the reactive flywheel.
Show more
Position: Generative Distributional Integrity against Backdoor Attacks
Shuaibiao Han ⋅ Ruiyang Ni ⋅ Zhiguo Yang ⋅ Changlong Li ⋅ Perley Xu ⋅ Wenjie Ruan
Foundation models, such as Diffusion Models (DMs) and Large Language Models (LLMs), are now widely integrated into digital systems. This widespread use introduces a specific security risk: generative backdoors. Unlike traditional models where backdoors cause simple classification errors, generative backdoors hide within the model’s output distribution. This makes them difficult to detect using standard pattern-based methods.This paper argues that current defensive strategies are insufficient for generative AI. \textbf{We propose Distributional Integrity, a framework that focuses on maintaining the stability and accuracy of the model's data distribution.} We identify two primary threats: backdoors within the model supply chain and the contamination of synthetic data pipelines. To address these, we advocate for a shift toward cross-modal certification and parameter-level verification. These methods aim to secure the AI-generated content (AIGC) ecosystem against inherited vulnerabilities.
Show more
Position: Behavioral Systems Require Behavioral Tests
Manuel Cherep ⋅ Nikhil Singh ⋅ Pattie Maes
Artificial agentic systems increasingly operate as behavioral systems by interacting with dynamic environments, pursuing goals, and adapting over time. Yet, current evaluation methods largely focus on performance outcomes, not the underlying behavioral processes that produce them. This paper argues that AI agents must be evaluated like other behavioral systems: through systematic observation, perturbation, and interpretation of their actions. We draw on lessons from the behavioral sciences to motivate this position, and propose a research agenda focused on developing rigorous behavioral tests. These include methods for recovering decision strategies from action sequences, constructing environments that isolate behavioral differences, and probing emergent dynamics in multi-agent systems. Taken together, these directions offer a roadmap for developing a science of AI behavior.
Show more
Position: Good Embodied Reward Models Need Bad Behavior Data
Thomas Tian ⋅ Yilin Wu ⋅ Andrea Bajcsy
This position paper argues that to obtain reliable embodied reward models, the community must invest in "bad" robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.
Show more
Position: Virtual Cells Need Context, Not Just Scale
Payam Dibaeinia ⋅ Sudarshan Babu ⋅ Mei Knudson ⋅ Ali ElSheikh ⋅ Yibo Wen ⋅ Han Liu ⋅ Jason Perera ⋅ Aly Khan
The intersection of AI and biology has entered a phase of explosive growth, driven by the ambition to build "Virtual Cells" or computational models capable of predicting cellular responses to any perturbation. Following the success of structural biology (e.g., AlphaFold) and large language models, the field has converged on training massive, high-capacity models on large-scale single-cell data. This position paper argues that scaling model capacity is insufficient to solve the Virtual Cell problem because the primary failure mode is a *lack of adequate coverage over diverse biological contexts*, not insufficient model expressivity. We support this claim by reviewing recent studies showing that simple baselines perform on par with sophisticated architectures within a given biological context, and current models fail to consistently generalize across contexts. We connect this finding to the causal inference literature on transportability and contrast it with domains where scaling has succeeded. We substantiate our argument through analysis of a state-of-the-art model on a 22-million-cell immunology dataset. We conclude that the community faces a *causal transport problem* that cannot be solved by accumulating more data from the same distributions. Instead, we argue that contextual diversity and causal representation learning deserve increased emphasis, complementing ongoing scaling of model capacity and data volume.
Show more
Position: Metaphysical Concepts in AI Should Be Judged by Their Consequences
Paras Chopra
This position paper argues that answers to metaphysical puzzles in AI (such as ``Can LLMs be conscious?'' or ''What is AGI?'') should be judged by their practical consequences rather than their supposed truth. Our key position is that metaphysical concepts earn their value through the new research directions they open. Drawing on Pragmatism, we propose a two-step framework–*productive confusion*–to navigate conceptual confusions: first, clarify the different meanings a metaphysical concept has in ordinary language, then use this understanding to invent new empirical research programs. We illustrate our framework with numerous examples and show how it inspires progress for cutting-edge AI research. We contrast our position with Scientific Realism (which supposes science reveals ultimate truths) and Quietism (which brushes aside metaphysical puzzles as useless). We end with a call to action that operationalizes our position for multiple stakeholders in the AI community including researchers, decision makers and reviewers.
Show more
Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models
Liangwei Yang ⋅ Shiyu Wang ⋅ Haolin Chen ⋅ Rithesh Murthy ⋅ Ming Zhu ⋅ Jielin Qiu ⋅ Zixiang Chen ⋅ Juntao Tan ⋅ Jianguo Zhang ⋅ Zhiwei Liu ⋅ Wenting Zhao ⋅ Silvio Savarese ⋅ Caiming Xiong ⋅ Huan Wang ⋅ Shelby Heinecke
As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text prompts can already customize LLM behavior, we argue that text-only prompting does not constitute a suitable control interface for scalable, stable, and inference-only customization. This position paper argues that model providers should expose \emph{vector prompt inputs} as part of the public interface for customizing LLMs. We support this position with diagnostic evidence showing that vector prompt tuning continues to improve with increasing supervision whereas text-based prompt optimization saturates early, and that vector prompts exhibit dense, global attention patterns indicative of a distinct control mechanism. We further discuss why inference-only customization is increasingly important under realistic deployment constraints, and why exposing vector prompts need not fundamentally increase model leakage risk under a standard black-box threat model. We conclude with a call to action for the community to rethink prompt interfaces as a core component of LLM customization.
Show more
Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity
Rachel Lawrence ⋅ Jacqueline Maasch
Autonomous reasoning is among the most scientifically and economically motivating topics in AI today. Historically the purview of symbolic AI, recent advances have mainly emerged from deep probabilistic generative models. Despite immense interest and rapid progress, the generative AI community has not clearly converged on operational definitions for reasoning and often implicitly rejects the historical treatment of this topic in logic, verifiable automated reasoning, and symbolic methods in general. **This position contends that definitional ambiguity leaves the construct validity of reasoning evaluation unverifiable, and undermines quantifiable progress toward the collective goal of trustworthy autonomous reasoning.** We also contend that this ambiguity is addressable. To that end, we provide (1) general and extensible definitions for *valid* and *sound reasoning* based on a synthesis of the literature, which can serve as an accessible reference and a starting point for community discussion; and (2) a checklist for best practices in the communication of AI reasoning research.
Show more
Position: Web Agents Should Use Typed Actions Instead of Click-Based Browsing
Linxi Jiang ⋅ Rui Xi ⋅ Zhijie Liu ⋅ Shuo Chen ⋅ Zhiqiang Lin ⋅ Suman Nath
This position paper argues that building a reliable agentic web requires shifting from click-based browsing to typed actions supported by a standardized semantic layer. Today’s agents primarily operate over low-level primitives such as clicks, keystrokes, and DOM manipulation. This reliance leads to brittle long-horizon behavior, high execution cost, and limited auditability. We contend that a semantic layer of typed web actions, analogous to the abstraction provided by high-level programming languages, is necessary for agents to compose reliable workflows from stable, well-specified operations. We recommend *Web Verbs* as a concrete instantiation of this semantic layer. A verb is a typed, semantically documented function that exposes a site capability through a uniform interface, whether implemented via server APIs or by wrapping robust client-side workflows. Verbs can attach preconditions, postconditions, policy tags, and logging hooks, allowing agents to synthesize concise programs with explicit control and data flow and to produce checkable execution traces. Using representative cross-site case studies, we demonstrate that verb-level composition produces correct, reproducible outcomes, while GUI-level agents often exhibit brittle behavior or incorrect reasoning. We conclude with a call to action on standardization, developer tooling, and community processes needed to make this semantic layer deployable and trustworthy at web scale.
Show more
Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy
Matthew Vandergrift ⋅ Esraa Elelimy ⋅ Martha White
One goal in reinforcement learning (RL) research is to understand general purpose sequential decision-making, using benchmark simulators as a proxy for learning in a deployment setting. When running experiments, however, the goal of achieving high performance in the simulator can mutate into focusing exclusively on solving the simulator. To achieve high scores researchers may adopt solutions exclusively meant for solving simulators, rather than learning while the agent is deployed outside of a simulator. Solving simulators is also worthy of investigation, but is a fundamentally different RL research question. *In this paper we argue that RL researchers need to distinguish between two uses cases of simulators: solving simulators and using simulators as a proxy for learning in deployment.* We first discuss how these two use-cases are importantly different, in terms of constraints on how the agent can use the simulator, which algorithms are appropriate and which evaluation metrics are appropriate. We then highlight several issues and misleading conclusions that can occur by not making the distinction between these two settings clear, supported with examples and simple experiments. This work is a call to the community to begin clearly distinguishing how they are using simulators in their work, hopefully sparking further discussion on which empirical practices work best in each setting.
Show more
Position: The Case for Theory-Level Autoformalization
Marcus Min ⋅ Deyuan Mike He ⋅ Zhaoyu Li ⋅ Zixuan Yi ⋅ Sharad Malik ⋅ Aarti Gupta ⋅ Xujie Si ⋅ Osbert Bastani
Autoformalization, translating informal natural language into formal, machine-verifiable languages, has been framed as a tool to generate training data for neural theorem provers, with most work focusing on individual statements. This position paper argues for theory-level autoformalization: formalizing complete theories, including axioms, definitions, theorems, proofs, tactics, and their inter-dependencies as structured libraries. We examine the significance of this shift, address 3 alternative views, identify 5 open challenges, and propose 3 promising paths forward.
Show more
Position: Don't Just "Fix it in Post'': A Science of AI Must Study Learning Dynamics
Stella Biderman ⋅ Mohammad Aflah Khan ⋅ Fatemehsadat Mireshghallah ⋅ Catherine Arnett ⋅ Fazl Barez ⋅ Naomi Saphra
What would it mean to have a *scientific* understanding of AI? Language models are not static objects—they are snapshots of time-evolving processes shaped by data, objectives, and optimization dynamics. Yet the field predominantly treats models as fixed artifacts, analyzing behaviors after training rather than asking *why* they emerge. **This position paper argues that AI research should move beyond *post hoc* fixes and study the learning dynamics of models.** We envision a hierarchy of scientific maturity: first *predict* outcomes from early training signals, then *intervene* when trajectories go wrong, ultimately *design* training procedures that guarantee desired properties. Scaling laws have reached the first level for loss; the challenge is extending all three levels to general capabilities, biases, and safety. We articulate requirements for such theories, survey progress across mechanistic interpretability, fairness, memorization, and learning dynamics, and identify concrete open problems. The path forward requires treating models as processes to be understood, not just artifacts to be patched.
Show more
Position: AGI Requires a Coordination Layer on Top of Pattern Repositories
Edward Chang
This **position paper** argues that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck: they confuse the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that selects, constrains, and binds these patterns. We formalize this layer via an anchoring theory that models reasoning as a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). We translate theory into architecture with a multi-agent coordination stack. Moving beyond the hype of unstructured swarms, this layer provides a principled integration of diversity and control via baiting (PID-modulated debate), filtering (trace-output verification), and persistence (transactional memory). Empirical validation on causal judgment and the sycophancy-paranoia trade-off demonstrates that static prompting fails where adaptive control succeeds, confirming that failures attributed to substrate limitations are often resolved by regulated coordination. By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them.
Show more
Position: Quantum Program Generation Must Prioritize Validity Over Probabilistic Scaling
Junhao Song ⋅ Yu Zhou ⋅ William J. Knottenbelt ⋅ Yudong Cao
The scaling hypothesis assumes that increasing model parameters yields emergent reasoning capabilities. This position paper argues that applying this probabilistic paradigm to generic quantum circuit synthesis is a category error. Unlike natural languages, quantum circuits require strict adherence to mathematical constraints, such as unitarity. Training on unverified code constitutes data poisoning. Models learn syntax but fail to capture the physical semantics of Hilbert space. Since the valid subset of circuit designs decays exponentially with the number of qubits, post-hoc filtering is mathematically intractable. We propose a pivot from human-centric copilots to verifier-centric agents. We integrate hierarchical constraints, topological masks, and symbolic proxies directly into generation. Our analysis suggests that scale alone cannot bridge the validity gap. Verification-aware architectures offer a viable path for modular quantum program generation. The community must stop simulating the physicist and instead satisfy the physical rules.
Show more
Position: Make Planning Research Rigorous Again!
Michael Katz ⋅ Harsha Kokel ⋅ Christian Muise ⋅ Shirin Sohrabi ⋅ Sarath Sreedharan
In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. **It is our position that this rigor should be applied to the current trend of work on planning with large language models.** One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that establishing practices that avoid such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.
Show more
Position: Sustainable Open-Source AI Requires Tracking the Cumulative Footprint of Derivatives
shaina raza ⋅ Iuliia Zarubiieva ⋅ Ahmed Radwan ⋅ Nathaniel Lesperance ⋅ Deval Pandya ⋅ Sedef Akinli Kocak ⋅ Graham Taylor
Open-source AI is scaling rapidly, and model hubs now host millions of artifacts. Each foundation model can spawn large numbers of fine-tunes, adapters, quantizations, merges, and forks. We take the position that compute efficiency alone is insufficient for sustainability in open-source AI. Lower per-run costs can accelerate experimentation and deployment, increasing aggregate footprint unless impacts are measurable and comparable across derivative lineages. However, the energy use, water consumption, and emissions of these derivative lineages are rarely measured or disclosed in a consistent, comparable way, leaving aggregate ecosystem impact largely invisible. We argue that sustainable open-source AI requires a coordination infrastructure that tracks impacts across model lineages, not only base models. We propose Data and Impact Accounting (DIA), a lightweight, non-restrictive transparency layer that (i) standardizes carbon-and-water reporting metadata, (ii) integrates low-friction measurement into common training and inference pipelines, and (iii) aggregates reports via public dashboards to summarize cumulative impacts across releases and derivatives. DIA makes derivative costs visible and supports ecosystem-level accountability while preserving openness.
Show more
Position: Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment
Jared Fernandez ⋅ Clara Na ⋅ Yonatan Bisk ⋅ Constantine Samaras ⋅ Emma Strubell
Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale. With the growing complexity of pipelines and underlying infrastructure needed to develop and deploy AI systems, previous approaches for evaluating AI efficiency which focus on the costs of a single training run or an individual inference prediction are no longer sufficient. In this position paper, we enunciate the need for applying life cycle assessment to evaluate the costs of the machine learning model development and deployment pipeline to properly account for the required resources and downstream impact. Life cycle assessments enable the incorporation of costs across the full life cycle of an AI system and its underlying infrastructure, from the embodied costs associated with the physical computing hardware through the operational costs in training and inference.
Show more
Position: LLM for Physics Research Requires Domain-Specialized Training and Tooling
Sirui Lu ⋅ Zhijing Jin ⋅ Terry Zhang ⋅ Pavel Kos ⋅ Juan Cirac ⋅ Bernhard Schölkopf
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics remains inadequate. While current models show competence in mathematical reasoning and code generation, we identify critical gaps in physical intuition, constraint satisfaction, and reliable reasoning that cannot be addressed through prompting alone. Physics demands approximation judgment, symmetry exploitation, and physical grounding that require AI agents specifically trained on physics reasoning patterns and equipped with physics-aware verification tools. We argue that LLM would require such domain-specialized training and tooling to be useful in real-world for physics research. We envision physics-specialized AI agents that seamlessly handle multimodal data, propose physically consistent hypotheses, and autonomously verify theoretical results. Realizing this vision requires developing physics-specific training datasets, reward signals that capture physical reasoning quality, and verification frameworks encoding fundamental principles. We call for collaborative efforts between physics and AI communities to build the specialized infrastructure necessary for AI-driven scientific discovery.
Show more
Position: Temporal Measurement Interval Determines Computational and Model Complexity in Single-Cell Perturbation Analysis
Alireza Jafari ⋅ Heman Shakeri ⋅ Hadi Daneshmand
Single-cell perturbation analysis aims to predict how cellular states change after interventions such as drug treatments or genetic edits. A central difficulty is that pre- and post-perturbation measurements are typically observed as *unpaired* populations, so accurate prediction requires inferring a latent coupling and learning a transition map. In this position paper, we argue that the *measurement time gap* is the key experimental knob controlling both the computational tractability of coupling and the effective model complexity. We identify a critical time gap $\Delta$ that induces a phase transition, under biologically inspired conditions; for "measurement-time $< \Delta$", matching is polynomial-time tractable and the task reduces to supervised learning, whereas for "measurement-time $>\Delta$", recovering the matching is NP-hard in the worst case. The required conditions are restricted isometry of the initial states and temporal smoothness of the transition dynamics. We complement the theory with empirical evidence on synthetic and biological datasets showing a sharp regime change as the time gap increases. Furthermore, we demonstrate that a linear model can match or exceed the performance of higher-capacity neural approaches when our conditions hold.
Show more
Position: The Inevitable Transition to Machine Learning in Quantum Chemistry
Karen Sargsyan ⋅ Chao-Ping Hsu
Finding exact solutions to the quantum many-body problem is computationally intractable (QMA-hard). Traditional approximations for electrons in an atom or molecule---density functional theory and wavefunction methods---have been indispensable, but their development shows signs of saturation: DFT functionals have proliferated without converging toward the exact functional, and strong correlation remains largely unsolved after decades of effort. This position paper argues that machine learning represents the most promising path forward---not as a proof of logical necessity, but as a decision-theoretic argument: ML succeeds whether the underlying problems are truly hard or merely lack simple analytical solutions. We reframe recent traditional method development as ``hand-crafted machine learning'' that has exhausted the hypothesis space accessible to human intuition. Significant challenges remain, but these have clear research paths forward, unlike the fundamental barriers facing traditional approaches. ML-based approaches merit strategic priority in quantum chemistry's next phase.
Show more
Position: Significant impact of numerical precision in scientific machine learning
Youngwoo Cho ⋅ Jaekak Yoo ⋅ Soyoung Yang ⋅ Dong-Joon Yi ⋅ Seung Lee ⋅ Mun Jeong ⋅ Jaegul Choo
The machine learning community has focused on computational efficiency, often leveraging lower-precision formats such as FP16, rather than the standard FP32. In contrast, little attention has been paid to higher-precision formats, such as FP64, despite their critical role in scientific domains like materials science, where even small numerical differences can lead to significant inaccuracies in physicochemical properties. This need for high precision extends to the emerging field of *machine learning for scientific tasks*, yet it has not been thoroughly investigated. According to several studies and our experiments, models trained with FP32 show insufficient accuracy compared to those trained with FP64, indicating that higher precision is also crucial in scientific machine learning, as in traditional scientific computing. This precision issue limits the potential of scientific machine learning that can replace the traditional scientific computing in practical research. Our position paper not only highlights these precision-related issues but also recommends reporting comparisons between FP32 and FP64 results, encouraging the release of FP64 models. We believe that these efforts can enable machine learning to contribute meaningfully to the natural sciences, ensuring both scientific reliability and practical applicability.
Show more
Position: Universal Aesthetic Alignment Narrows Artistic Expression
Wenqi Guo ⋅ Qingyun Qian ⋅ Khalad Hasan ⋅ Shan Du
Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when "anti-aesthetic" outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. This position paper finds that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.
Show more
Position: Vision encoders should be image size agnostic and task driven
Nedyalko Prisadnikov ⋅ Danda Pani Paudel ⋅ Yuqian Fu ⋅ Luc Van Gool
This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait – efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We – humans and animals – deal with vast quantities of visual data, and need to be smart where we focus our limited energy – it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision – a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
Show more
Position: The Systemic Lack of Agency in Visual Reasoning
Yizhao Huang ⋅ Haoyang Chen ⋅ Pohsun Huang ⋅ Jiayuan Li ⋅ Shiqin Wang ⋅ Haoyuan Du ⋅ Yandong Shi ⋅ Zheng Wang ⋅ Zhixiang Wang
This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to equate visual reasoning with passive semantic retrieval, rather than with active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs.
Show more
Position: Human-Centric Vision Requires Topological Generalization Beyond Fixed Skeletal Topologies
Heming Du ⋅ Jiaying Ying ⋅ Kaihao Zhang ⋅ Tianqing Zhu ⋅ Xin Yu
In this position paper, we argue that human-centric vision requires skeletal-topology generalization beyond fixed skeletons. Mainstream pose and body pipelines enforce a fixed skeleton graph with an indexed joint list and fixed adjacency, so the fixed joint inventory does not cover structural absence and anatomical absence becomes an ill-posed target for individuals with limb deficiencies. Anatomical absence is not a visibility state, so masking and forced completion can hide structural mismatch and produce hallucinated structure that contaminates downstream reasoning in prosthesis-facing settings. We argue that scaling data and model size alone does not resolve this mismatch while the skeleton schema remains fixed, and this is not a niche concern because these failures affect a large population and reach accessibility-facing systems. We advocate instance-adaptive skeletal topology, where a model jointly predicts joint existence and skeletal connectivity to produce an instance-specific skeleton graph that supports consistent inference and evaluation. We outline measurement upgrades, including existence-aware annotations with explicit absence semantics, skeletal-topology-aware scoring, and hallucination-under-absence penalties, and we close with a call to action for dataset curators, benchmark organizers, and model builders to treat morphological variation as a first-class generalization axis.
Show more
Position: Your VLM May Not Be Thinking with Interleaved Images
Wenjie Yang ⋅ Siqi Zhu ⋅ Zengfeng Huang
"Thinking with images" has emerged as a central research theme in the realm of Vision-Language Models (VLMs). This multimodal reasoning paradigm typically features interleaved images generated via tool use or code execution as part of the Chain-of-Thought (CoT). While reinforcement learning (RL) has driven impressive performance within this paradigm, **in this position paper, we argue that current VLMs seldom truly "think" with interleaved images.** Through empirical evidence and analysis, we demonstrate that interleaved images do not play a significant role in the success of recent "Thinking with images" methods. Instead, the primary source of performance gains is the improved language generation distribution resulting from fine-tuning. These findings challenge the prevailing belief that "Thinking with images" VLMs actively utilize visual information to complete visual tasks. To improve mechanistic transparency, we suggest that future "Thinking with images" works include lightweight ablation studies to verify the necessity of interleaved images. Furthermore, we call upon the community to develop fundamentally novel benchmarks and advocate for more informative visual tools.
Show more
Position: We need to re-think the concept of “real” images.
Janis Keuper ⋅ Margret Keuper
The wide availability and low usability barrier of modern image generation models has triggered the reasonable fear of criminal misconduct and negative social implications. The machine learning community has been engaging this problem with an extensive series of publications proposing algorithmic solutions for the detection of "fake'', e.g. entirely generated or partially manipulated images. While there is undoubtedly some progress towards technical solutions of the problem, we argue that current and prior work is focusing too much on generative algorithms and "fake'' data-samples, neglecting a clear definition and data collection of "real'' images. The fundamental question *"what is a real image?''* might appear to be quite philosophical, but our analysis shows that the development and evaluation of basically all current "fake''-detection methods is relying on only a few, quite old low-resolution datasets of "real'' images like *ImageNet*. However, the technology for the acquisition of "real'' images, aka taking photos, has drastically evolved over the last decade: Today, over 90% of all photographs are produced by smartphones which typically use algorithms to compute an image from multiple inputs (over time) from multiple sensors. Based on the fact that these image formation algorithms are typically neural network architectures which are closely related to "fake''-image generators, we state the position that today, **we need to re-think the concept of "real'' images**. The purpose of this position paper is to raise the awareness of the current shortcomings in this active field of research and to trigger an open discussion whether the detection of ``fake'' images is a sound objective at all. At the very least, we need a clear technical definition of "real'' images and new benchmark datasets.
Show more
Position: AI for Science Should Treat Measurement-to-Dataset Pipelines as Inference Components
Ling Zhan ⋅ Xiaoyao Yu ⋅ Tao Jia
AI for Science (AI4Science) workflows often treat the released dataset as a fixed interface to the underlying system. However, in domains relying on *indirect observation*, the learner observes a derivative representation produced by multi-stage measurement, reconstruction, and preprocessing pipelines. **We argue that these measurement-to-dataset pipelines are inference components: treating their outputs as "given data" freezes an observation model and obscures uncertainty over feasible pipeline choices.** We identify three failure modes arising from this "frozen lens": **(C1) hidden hypothesis space**, where the released dataset does not specify the pipeline configuration or its validity conditions; **(C2) uncertified transportability**, where a pipeline may be documented but its regime of validity is untested, so failures under distribution shift cannot be adjudicated; **(C3) ungoverned multiplicity**, where many defensible pipelines exist and dispersion is real but not propagated into uncertainty-aware evidence. We stress-test these claims with a large-scale neuroscience empirical audit, finding a survival rate of ≈ 0.0004% under a cross-dataset stability criterion. We call on the AI4Science community to make pipelines *computable* inference objects via domain-specific Computable Observation Frameworks. This shift enables quantifying pipeline adequacy and stability, converting implicit implementation choices into auditable, reproducible, and cumulative scientific evidence.
Show more
Position: Stop Chasing the C-index when Evaluating Survival Analysis Models
Christian Marius Lillelund ⋅ Shi-ang Qi ⋅ Russell Greiner ⋅ Christian Fischer Pedersen
The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we present a critical analysis of evaluation practices in survival analysis and highlight why evaluation in survival analysis fundamentally differs from standard regression or classification due to censoring. We place particular focus on concordance-based measures, such as the C-index, which our findings indicate are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions, and we provide empirical evidence that this is effective. We conclude by providing practical guidance on how to evaluate a survival model.
Show more
Position: Beyond Prediction: Toward Verifiable Physiological Waveform Reasoning with Foundation Models and Agentic LLMs
Xiaoda Wang ⋅ Ching Chang ⋅ Defu Cao ⋅ Kaiqiao Han ⋅ Fang Sun ⋅ Yue Huang ⋅ Minxiao Wang ⋅ Chang Xu ⋅ Xiao Luo ⋅ Runze Yan ⋅ Xiangliang Zhang ⋅ Xiao Hu ⋅ Yan Liu ⋅ Yizhou Sun ⋅ Wei Wang ⋅ Carl Yang
Physiological waveforms (e.g., ECG, PPG, EEG) encode clinically meaningful information in fine-grained morphology, precise timing, and cross-channel dynamics, yet most machine learning systems still treat them as generic time series and optimize end-to-end prediction. In this position paper, **we argue for verifiable physiological waveform reasoning: extracting localized, measurable signal evidence from raw signals, interpreting that evidence into physiological semantics, and supporting clinically grounded decisions.** Waveform reasoning is challenging due to acquisition heterogeneity, signal fidelity, complex semantics and cross-channel coupled dynamics. We analyze why existing model families remain insufficient: physiological foundation models learn strong perceptual representations but remain weak at verifiable reasoning, while LLM-based adaptations have limited waveform understanding. To bridge this gap, **we advocate verifiable, closed-loop systems that unify waveform semantics with language intelligence.** Concretely, we propose a dual-process architecture that System 1 aligns physiological waveforms with language, and System 2 provides agentic reasoning via a Plan--Act--Verify loop, together enabling verifiable physiological waveform reasoning. We further propose evaluations beyond accuracy, emphasizing traceability, replayability, counterfactual robustness, and calibrated abstention.
Show more
Position: Evaluation of ECG Representations Must Be Fixed
Zachary Berger ⋅ Daniel Prakah-Asante ⋅ John Guttag ⋅ Collin Stultz
This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
Show more
Position: Deciphering the Functions of DNAs, RNAs, and Proteins Should Consider Multi-Modal Large Language Models
Pengtao Xie ⋅ Victor Nizet ⋅ Lei Wang ⋅ Ahmed Alaa ⋅ Daniel Zielinski ⋅ Trey Ideker ⋅ Bernhard Palsson
Understanding the functions of DNAs, RNAs, and proteins is fundamental to advancing life science research and enabling translational applications such as drug discovery and precision medicine. While deep learning methods have shown promise in biomolecular function prediction, they typically constrain outputs to predefined categories and require training separate models for each task. Existing multi-task learning methods operate on a fixed set of predefined tasks and require model retraining when new tasks arise. Furthermore, current approaches produce one-shot, static outputs, lacking the capacity for iterative refinement or deeper exploration of predictions. This position paper argues that multi-modal large language models (LLMs) are essential for enabling free-form and interactive prediction of biomolecular functions, and zero-shot generalization to new tasks without model retraining. These models can generate coherent and context-aware text outputs that reflect the complexity and nuance of diverse functional roles. Importantly, they can generalize to novel biomolecules whose functions are unknown or poorly characterized, and they enable generalization to new tasks through prompt-driven adaptation, eliminating the need for task-specific retraining. Additionally, multi-modal LLMs enable interactive, multi-turn dialogue, allowing users to iteratively refine queries, clarify contexts, and explore hypotheses in a dynamic and responsive manner. By leveraging these capabilities, multi-modal LLMs provide a scalable, adaptable, and generalizable framework for advancing biomolecular function prediction and accelerating biological discovery.
Show more
Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods
Shasha Zhou ⋅ Mingyu Huang ⋅ Ke Li
Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists increasingly demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research employs a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for identical predictions, (2) fail to localize known regulatory motifs, and (3) do not faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials. Just as trials require rigorous design and the reporting of adverse events, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide the rigorous evaluation and reporting of genomic IML methods.
Show more
Position: Medical AI Neglects Real Treatment Outcomes
Shiva Kaul ⋅ Anjum Khurshid
Medical AI has rapidly improved its ability to perform diagnostic and prognostic tasks that lead to treatment decisions. But understanding of treatment itself is still inadequately trained and evaluated, using human opinions and syntheses (especially texts such as biomedical publications and clinical practice guidelines) rather than actual underlying data on treatment outcomes. This neglect seriously limits the long-term potential of medical AI, and is already causing deficiencies in both frontier models and major benchmarks, as argued in this position paper. Real treatment outcomes, drawn from sources such as observational databases and randomized experiments, should be substantially incorporated into both training and evaluation. Improving these outcomes should be reemphasized as the goal of all medical AI.
Show more
Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning
Yiqun Sun ⋅ Qiang Huang ⋅ Anthony Tung ⋅ Jun Yu
**This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective.** Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current embedding models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity.
Show more
Position: Towards Responsible Evaluation for Text-to-Speech
Yifan Yang ⋅ Hui Wang ⋅ Bing Han ⋅ Shujie Liu ⋅ Jinyu Li ⋅ Yong Qin ⋅ Xie Chen
Recent advances in text-to-speech (TTS) technology have enabled systems to generate speech that is often indistinguishable from human speech, bringing benefits to accessibility, content creation, and human-computer interaction. However, current evaluation practices are increasingly inadequate for capturing the full range of capabilities, limitations, and societal impacts of modern TTS systems. This position paper introduces the concept of Responsible Evaluation and argues that it is essential and urgent for the next phase of TTS development, structured through three progressive levels: (1) ensuring the faithful and accurate reflection of a model's true capabilities and limitations, with more robust, discriminative, and comprehensive objective and subjective scoring methodologies; (2) enabling comparability, standardization, and transferability through standardized benchmarks, transparent reporting, and transferable evaluation metrics; and (3) assessing and mitigating ethical risks associated with forgery, misuse, privacy violations, and security vulnerabilities. Through this concept, we critically examine current evaluation practices, identify systemic shortcomings, and propose actionable recommendations. We hope this concept will not only foster more reliable and trustworthy TTS technology but also guide its development toward ethically sound and societally beneficial applications.
Show more
Position: Natural Language Should Not Fully Replace Formal Languages
Eitan Wagner ⋅ Elisha Rosensweig ⋅ Omri Abend
Recent advances in large language models and their widespread adoption have prompted claims that natural language could entirely replace formal languages, such as programming languages, for software design. In this position paper, we argue that this perspective overlooks fundamental linguistic properties of natural language, specifically that it is optimized for underspecification in open-ended contexts. We introduce a formal framework centered on *task specificity*, defining it as the information-theoretic reduction of uncertainty—in an output space, such as all possible images—given a user's specific requirements. We prove a *specificity crossover theorem*, showing the existence of a threshold beyond which the cost to express formal requirements into natural language exceeds the cost of direct formal specification. By analyzing case studies across modalities, such as image generation, code synthesis, and audio production, we demonstrate that natural language excels at low specificity tasks, while formal languages are advantageous on tasks with stricter requirements. We conclude that natural and formal languages are complementary tools and advocate the development of hybrid systems that allow users to move across the specificity spectrum.
Show more
Position: Hippocampal Explicit Memory Is a Cornerstone to Human-Level AI
Sangjun Park
Recent artificial neural networks has demonstrated remarkable capabilities across various tasks, raising expectations for Human-Level AI (HLAI). This position paper argues that integrating explicit memory is instrumental in advancing current AI towards HLAI. The key reason is that the underlying learning mechanism of artificial neural networks bears a notable resemblance to implicit memory of the basal ganglia. However, higher-order cognitive functions necessary for HLAI, such as long-term strategic planning, metacognition, and symbolic reasoning, heavily rely on the hippocampal explicit memory and cannot arise solely from implicit statistical learning. Based on this perspective, we define the computational requirements for artificial explicit memory systems, with the aim of fostering further research and laying the groundwork for explicit memory integration.
Show more
Position: Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World
Joonkyung Kim ⋅ Wenxi Chen ⋅ Davood Soleymanzadeh ⋅ Yi Ding ⋅ Xiangbo Gao ⋅ Zhengzhong Tu ⋅ Ruqi Zhang ⋅ Fan Fei ⋅ Sushant Veer ⋅ Yiwei Lyu ⋅ Minghui Zheng ⋅ Yan Gu
The integration of foundation models (FMs) into robotics has accelerated real-world deployment, while introducing new safety challenges arising from open-ended semantic reasoning and embodied physical action. These challenges require safety notions beyond physical constraint satisfaction. In this position paper, we characterize FM-enabled robot safety along three dimensions: action safety (physical feasibility and constraint compliance), decision safety (semantic and contextual appropriateness), and human-centered safety (conformance to human intent, norms, and expectations). We argue that existing approaches, including static verification, monolithic controllers, and end-to-end learned policies, are insufficient in settings where tasks, environments, and human expectations are open-ended, long-tailed, and subject to adaptation over time. To address this gap, we propose modular safety guardrails, consisting of monitoring (evaluation) and intervention layers, as an architectural foundation for comprehensive safety across the autonomy stack. Beyond modularity, we highlight possible cross-layer co-design opportunities through representation alignment and conservatism allocation to enable faster, less conservative, and more effective safety enforcement. We call on the community to explore richer guardrail modules and principled co-design strategies to advance safe real-world physical AI deployment.
Show more
Position: Time-Series Foundation Models Require Explicit Domain-Level Benchmarks
Md Asif Bin Syed ⋅ Md Younus Ahamed ⋅ Azmine Toushik Wasi
Time series foundation models (TSFMs) have demonstrated strong performance on established benchmarks such as GIFT-Eval, Monash, and TSFM-Bench. However, these benchmarks pool datasets from many domains with uneven representation, which can obscure performance within specific application areas such as healthcare, finance, nature, retail, and transport. The necessity for domain-specific evaluation arises from the inherent structural diversity of time series data: clinical records often feature irregular sampling and informative missingness; financial sequences are characterized by high noise and stochastic trajectories; and environmental data, such as energy and weather, are governed by deterministic physical laws and strong seasonal hierarchies. Motivated by this heterogeneity, **we argue that TSFMs require explicit domain-specific benchmarks** so practitioners can reliably assess a model's utility within their own application area. This is because cross-domain differences in data generation, sampling irregularity, and nonstationarity under concept drift fundamentally shape forecasting difficulty and failure modes. As a result, strong performance on aggregated leaderboards may not translate to reliable deployment within a specific domain. To test this, we evaluated seven TSFMs across 72 datasets from six domains (healthcare, finance, energy, nature, transport, and retail) and found substantial cross-domain variability. These findings confirm that global benchmark scores can be misleading and that domain-aware evaluations are essential for trustworthy TSFM selection.
Show more
Position: Current Benchmarking Hinders Real Progress in Deep Learning for Time Series Forecasting
Valentina Moretti ⋅ Andrea Cini ⋅ Ivan Marisca ⋅ Cesare Alippi
Deep learning models have grown popular in time series applications. However, the large quantity of newly proposed architectures and the often contradictory empirical results make it difficult to assess which design choice and model component drives performance. In this position paper, we argue that current benchmarking practices fail to identify the factors responsible for performance differences, thus slowing down progress in the field. In particular, differences in crucial design dimensions are overlooked when comparing architectures, ultimately leading to inconsistent outcomes. To support our position, we show that such differences—often treated as mere implementation details—can have a greater impact than adopting specific sequence modeling layers. We discuss how overlooked aspects (such as globality and locality) can (1) fundamentally change the class of the forecasting method and (2) drastically affect empirical results. Our findings suggest rethinking our benchmarking practices and focusing on the foundational aspects of the forecasting problem when designing and comparing architectures. As a concrete step, we propose an *auxiliary forecasting model card*, i.e., a template with a set of fields to characterize existing and new forecasting architectures based on key design choices.
Show more
Position: Why a Dynamical Systems Perspective is Needed to Advance Time Series Modeling
Daniel Durstewitz ⋅ Christoph Jürgen Hemmer ⋅ Florian Hess ⋅ Charlotte Ricarda Doll ⋅ Lukas Eisenmann
Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation models. With a lot of hype and industrial demand in this field, it is not always clear how much progress there really is. To advance TS forecasting and analysis to the next level, here we argue that the field needs a *dynamical systems (DS)* perspective. TS of observations from natural or engineered systems almost always originate from some underlying DS, and arguably access to its governing equations would yield theoretically optimal forecasts. This is the promise of *DS reconstruction (DSR)*, a class of ML/AI approaches that aim to infer *surrogate models* of the underlying DS from data. But models based on DS principles offer other profound advantages: Beyond short-term forecasts, they enable to predict the *long-term statistics* of an observed system, which in many practical scenarios may be the more relevant quantities. DS theory furthermore provides domain-independent *theoretical insight into mechanisms* underlying TS generation, and thereby will inform us, e.g., about upper bounds on performance of *any* TS model, generalization into unseen regimes as in tipping points, or potential control strategies. After reviewing some of the central concepts, methods, measures, and models in DS theory and DSR, we will discuss how insights from this field can advance TS modeling in crucial ways, enabling better forecasting with much lower computational and memory footprints. We conclude with a number of specific suggestions for translating insights from DSR into TS modeling.
Show more
Position: Interpretability in Deep Time Series Models Demands Semantic Alignment
Giovanni De Felice ⋅ Riccardo D`Elia ⋅ Alberto Termine ⋅ Pietro Barbiero ⋅ Giuseppe Marra ⋅ Silvia Santini
Deep time series models continue to improve predictive performance, yet their deployment remains limited by their black-box nature. In response, existing interpretability approaches in the field keep focusing on explaining the internal model computations, without addressing whether they align or not with how a human would reason about the studied phenomenon. Instead, we state interpretability in deep time series models should pursue semantic alignment: predictions should be expressed in terms of variables that are meaningful to the end user, mediated by spatial and temporal mechanisms that admit user-dependent constraints. In this paper, we formalize this requirement and require that, once established, semantic alignment must be preserved under temporal evolution: a constraint with no analog in static settings. Provided with this definition, we outline a blueprint for semantically aligned deep time series models, identify properties that support trust, and discuss implications for model design.
Show more
Position: Weight Space Should Be a First-Class Generative AI Modality
Zhangyang “Atlas” Wang ⋅ Kai Wang ⋅ Peihao Wang
Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a standardized five-stage pipeline for weight-space generation and survey applications where the approach is already practical, such as parameter-efficient adaptation, mid-scale model synthesis, and on-device learning. We then confront alternative views, clarify current limits, and issue a concrete call to action. Our goal is to shift the community’s default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely generate other AI systems.
Show more
Position: Quantum Deep Learning Still Needs a Quantum Leap
Hans Gundlach ⋅ Hrvoje Kukina ⋅ Jayson Lynch ⋅ Neil Thompson
Quantum computing technology is advancing rapidly. Yet, even accounting for these trends, a quantum leap would be needed for quantum computers to meaningfully impact deep learning over the coming decade or two. We arrive at this conclusion based on a first-of-its-kind survey of quantum algorithms and how they match potential deep learning applications. This survey reveals three important areas where quantum computing could potentially accelerate deep learning, each of which faces a challenging roadblock to realizing its potential. First, quantum algorithms for matrix multiplication and other algorithms central to deep learning offer small theoretical improvements in the number of operations needed, but this advantage is overwhelmed on practical problem sizes by how slowly quantum computers do each operation. Second, some promising quantum algorithms depend on practical Quantum Random Access Memory (QRAM), which is underdeveloped. Finally, there are quantum algorithms that offer large theoretical advantages, but which are only applicable to special cases, limiting their practical benefits. In each of these areas, we support our arguments using quantitative forecasts of quantum advantage that build on the work by Choi et al. (2023) as well as new research on limitations and quantum hardware trends. Our analysis outlines the current scope of quantum deep learning and points to research directions that could lead to greater practical advances in the field.
Show more
Position: Topological Machine Learning Cannot Progress without Experimental Standards
Inés Castilla Rieso
Topological Machine Learning provides strong discriminative power for classification tasks through the use of Topological Data Analysis, and more particularly, Persistent Homology. Although it has strong theoretical appeal, it remains underused by the broader Machine Learning community; criticism often targets the reliance on synthetic data and the absence of shared experimental standards, which makes reported results difficult to compare. Indeed, current empirical evaluations lack a consistent framework for assessing methods: the construction of topological signatures is often opaque, statistical significance testing to validate reported gains, computing times and robustness to perturbations-such as missing data or noise-are often omitted. We assert that **progress in Topological Machine Learning depends on establishing clear and consolidated experimental standards that support meaningful comparison across methods**, articulated through a transparent and reproducible empirical framework including data processing and performance evaluation. We review current practices, highlight their limitations, and propose a set of principles for conducting rigorous and comparable empirical evaluations. Adopting these standards will enable trustworthy studies, clarify the gains of new methods, and ultimately support the broader adoption of Topological Machine Learning by the Machine Learning community.
Show more
Position: Graph Condensation Needs a Reset—Move Beyond Full-dataset Training and Model-Dependence
Mridul Gupta ⋅ Samyak Jain ⋅ Vansh Ramani ⋅ HARIPRASAD KODAMANA ⋅ Sayan Ranu
Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation—the task of generating a smaller synthetic graph that retains the performance of models trained on the original—has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental—they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale.
Show more
Position: Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
Wei Liu ⋅ Siya Qi ⋅ Yali Du ⋅ Yulan He
Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing *learnable information* for the next iteration. Through experiments on a self-play coding task, we reveal that **sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations.** We identify triadic roles that self-evolving LLMs play: the *proposer*, which generates tasks; the *solver*, which attempts solutions; and the *verifier*, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.
Show more
Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Subbarao Kambhampati ⋅ Karthik Valmeekam ⋅ Siddhant Bhambri ⋅ Vardhan Palod ⋅ Lucas Saldyt ⋅ Kaya Stechly ⋅ Soumya Samineni ⋅ Durgesh Kalwar ⋅ Upasana Biswas
Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thoughts} -- implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model's thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.
Show more
Position: LLMs can't jump
Tom Zahavy
How do we fundamentally discover new things? In a letter to Maurice Solovine, Albert Einstein conceptualized discovery as a cyclical process involving an intuitive 'jump' from sensory experience to axioms, followed by logical deduction. While Generative AI has mastered Induction (statistical pattern matching) and is rapidly conquering Deduction (formal proof), we argue it lacks the mechanism for Abduction—the generation of novel explanatory hypotheses. Using Einstein’s formulation of General Relativity as a computational case study, we demonstrate that the prevailing theory of "creativity as data compression" (induction) fails to account for discoveries where observational data is scarce. This position paper argues that while a modern Large Language Model could plausibly execute the deductive phase of proving theorems from established premises, it is structurally incapable of the abductive 'Jump' required to formulate those premises. We identify the translation of simulation into formal axioms as the critical bottleneck in artificial scientific invention, and propose that physically consistent, multimodal world models offer the necessary sensory grounding to bridge this divide.
Show more
Position: Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning
Sanghyuk Chun ⋅ Olga Russakovsky
Multimodal learning has seen remarkable progress, particularly with large-scale pre-training across various modalities. Most current approaches are built on the assumption of a deterministic one-to-one alignment between modalities. However, this oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. The many-to-many property, or \emph{multiplicity}, is not a side-effect of noise or annotation error, but an inevitable outcome of intra-modal variability, representational asymmetry, and task-dependent ambiguity in multimodal tasks. We argue that multiplicity is a fundamental bottleneck that affects all stages of the multimodal learning pipeline: from data construction to model training and evaluation benchmarks. By formalizing its causes and consequences, we demonstrate how ignoring multiplicity leads to training uncertainty, unreliable evaluation, and degraded dataset quality. This position paper calls for new research directions on multimodal learning, including multiplicity-aware learning frameworks and dataset construction and evaluation protocols.
Show more
Position: Causality is Key for Interpretability Claims to Generalise
Shruti Joshi ⋅ Aaron Mueller ⋅ David Klindt ⋅ Wieland Brendel ⋅ Dhanya Sridhar ⋅ Patrik Reizinger
Interpretability research on large language models (LLMs) has produced methods that align model components to high-level concepts, yet their use has been accompanied by recurring failures: findings that do not generalise, and causal language that outruns the evidence. Our position is that Pearl’s causal hierarchy formally defines what constitutes a good alignment, what data or assumptions it requires, and what inferences it supports. Specifically, observations of model behaviour support only associational claims; interventions enable cause-effect claims, but not necessarily predictions of model behaviour; counterfactuals, or predictions of behaviour on unseen examples, are often unverifiable in current studies. We show how interpretability research can benefit from causal representation learning (CRL), which provides tools for provably extracting semantic variables and their relationships from activations, and outline practical requirements for generalisable insights: robustness to distribution shifts, sensitivity to assumptions, and compositionality of interventions. Our diagnostic framework helps practitioners select appropriate methods and mitigate failures to ensure that claims match evidence and findings generalise.
Show more
Position: VLM Causal Reasoning Benchmarks Should Probe Temporal Understanding, Not Presume It
Chinh Hoang ⋅ Mohammad Hasan
This position paper argues that vision-language model (VLM) benchmarks for causal reasoning rely on two under-examined assumptions. First, benchmarks presuppose temporal constitution, the understanding of time as the medium through which causes produce effects, without testing it as a prerequisite. Second, they insufficiently distinguish external symbolic scaffolding from internalized capability; scaffolding-invariance is the diagnostic signature of genuine internalization. Drawing on frameworks from art, philosophy, and psychoanalysis, we propose diagnostics that probe these foundations. Preliminary evidence from three VLMs shows systematic disparity between fluent causal text and valid causal structure, and qualitatively different responses to identical scaffolding manipulation. None of these patterns indicates constitutive internalization. Progress requires benchmarks that test temporal understanding and scaffolding-invariance, not only output accuracy.
Show more
Position: Peer Review Should Be Calibrated via LLM Scoring
Zijin Chen ⋅ lesui Yu ⋅ Xiaofei Liao ⋅ Hai Jin ⋅ Qinbin Li
As submission volumes grow, AI conference peer review increasingly suffers from scale drift and non-comparable scoring: similar rationales can yield markedly different numeric ratings due to subjective calibration and occasional incoherent or strategic scoring, even though scores often strongly influence outcomes. This position paper argues that **AI conference workflows should incorporate an LLM-driven calibration layer that maps reviewer rationales (e.g., strengths and weaknesses) into consistent and auditable anchor scores**. The residual between a reviewer’s reported score and the anchor score turns rationale--score misalignment into a measurable signal for targeted escalation. We instantiate an end-to-end pipeline and apply it to OpenReview data from ICLR 2023--2025 to quantify severity/leniency patterns and where misalignment concentrates. We further propose a lightweight post-check---requesting added justification or score revision when residuals are large---and estimate its impact via an offline counterfactual simulation. Finally, we outline an adoption playbook and governance boundaries, emphasizing that the LLM audits scoring coherence rather than replacing human judgment or making accept/reject decisions.
Show more
Position: Bridge Human Interpretation and Machine Representation With Explicit Specification For Qualitative Data Analysis In LLM Era
Xinyu Pi ⋅ Qisen Yang ⋅ Chuong Nguyen ⋅ Hua Shen
Large language models (LLMs) are increasingly used in qualitative data analysis, yet the field lacks a shared way to state what kinds of process LLM-based pipelines intend to produce. This position paper proposes an explicit specification perspective: separating meaning-making from modeling, and making both visible as part of the analytic. We introduce a 4×4 landscape that crosses levels of meaning-making with levels of modeling, and use it to situate and compare qualitative outputs across both human-led studies and LLM-assisted workflows. A structured analysis of prior work suggests that many current LLM pipelines emphasize surface organization and static representations, with fewer systems making explicit commitments to richer causal or dynamical models. We demonstrate that the landscape can be applied consistently through strong agreement in independent labeling, including an LLM-based annotation pass. We conclude with a research agenda for LLM-assisted qualitative analysis focused on explicit level selection, evidence-linked outputs, and governance mechanisms aligned with the strength of semantic and representational claims.
Show more
Position: Benchmarks for Vision–Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Rashid Mushkani
Vision–language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This position paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.
Show more
Position: The Open Benchmark Paradox Must Be Resolved through Sovereign Medical Evaluation
Keonwoo Kim ⋅ Hyeseon Ko ⋅ Hyejeong Jo ⋅ Sewon Kim ⋅ Yera Choi ⋅ JaeDeok Lee ⋅ Heeyoung Kwak ⋅ Yunwook Sung ⋅ Haanju Yoo
As medical large language models become increasingly involved in clinical actions, public benchmarks are often treated as proxies of deployment-readiness. However, this reliance creates a false sense of security because public scores are often based on data the models have already seen. We call this the Open Benchmark Paradox: making evaluation data public for research progress also makes data contamination inevitable, ruining its value as a reliable safety signal. This paradox induces three structural failures: (1) hidden contamination, where it is impossible to prove evaluation independence; (2) outdated standards, where static datasets fail to track evolving medical guidelines; and (3) jurisdictional divergence, where global averaging ignores local legal and ethical standards. To validate these risks, we audited frontier models using recent medical exam data, which confirmed a high probability of data contamination. To resolve such integrity issues in medical evaluation, we propose Sovereign Medical Evaluation (SME). Instead of public leaderboards, SME establishes a national infrastructure where health authorities manage private, isolated evaluation pipelines. Within this secure system, evaluations are automatically updated using live medical data and legal changes, ensuring they remain current and strictly separated from model training. SME provides the essential transition to a controlled, auditable, and legally grounded safety gate for medical AI.
Show more
Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation
Sunil Kothari ⋅ Sumukha Sharma Thoppanahalli Chandramouli ⋅ Naman Khandelwal ⋅ Praveen Kumar Gulipalli ⋅ Parth Kulshreshtha ⋅ Ashi Jain ⋅ Kriti Banka ⋅ Tanuja Chintada ⋅ Venkata Triveni ⋅ Manish Mehta ⋅ Tao Liu
This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. *When* validation occurs—not merely *what* validation methods are employed—fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established "shift-left" principle from software engineering, where empirical studies demonstrate 4–100× cost multipliers for defects detected in later development stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines, we argue, exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete. We propose a taxonomy of three *QA trigger points*—pre-annotation (T₀), post-annotation (T₁), and post-review (T₂)—that decompose annotation workflows into discrete validation opportunities. A survey of 47 recent papers reveals that only 4% report when validation occurs, a striking gap given timing's demonstrated impact in adjacent fields. Without explicit attention to QA timing, the community risks optimizing validation methods while ignoring the structural variable that may matter most. We call on researchers to report QA timing configurations, on platform developers to expose timing as a first-class parameter, and on the community to conduct controlled experiments testing whether the shift-left principle transfers to annotation contexts.
Show more
Position: Code Benchmarks Should Prioritize Rigor, Reliability, and Reproducibility
Jialun Cao ⋅ Yuk-Kit Chan ⋅ Zixuan Ling ⋅ Wenxuan Wang ⋅ Shuqing Li ⋅ Mingwei Liu ⋅ Ruixi Qiao ⋅ Yuting Han ⋅ Chaozheng Wang ⋅ Boxi Yu ⋅ Pinjia He ⋅ Shuai Wang ⋅ Zibin Zheng ⋅ Michael Lyu ⋅ Shing-Chi Cheung
Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the com- munity interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014 - 2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual prac- tice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when pro- viding test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code bench- marks must prioritize rigor in benchmark con- struction, reliability in evaluation, and repro- ducibility in release. To operationalize this po- sition, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not only stem from the significant effort required, but also from a lack of awareness re- garding their importance.
Show more
Position: AI Evaluations Should be Grounded on a Theory of Capability
Nathan Jo ⋅ Ashia Wilson
Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model’s underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left *implicit*. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator’s modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering practical guidelines for rigorously designing evaluations built on explicit theories of capability.
Show more
Position: There are futures that benchmark-driven AI cannot see
Sobhan Lotfi ⋅ Ava Iranmanesh ⋅ Lachin Naghashyar ⋅ Ali Shirali ⋅ Fateme Haredasht ⋅ Sanmi Koyejo ⋅ Phil Torr ⋅ Yong Suk Lee ⋅ Fazl Barez ⋅ Joel Lehman ⋅ Peter Norvig ⋅ Arvind Narayanan
Breakthroughs often come from ideas we could not have predicted in advance. In biology, this is called exaptation: traits evolved for one function become decisive for another. Scientific progress works similarly, but only if ideas survive periods when they appear uncompetitive by current metrics. This position paper argues that AI's benchmark-centered selection environment, while successful at bypassing complex debates about the nature of intelligence, taxes exaptation. When one selection rule dominates, ideas that do not fit it have nowhere to persist. The cost grows acute as the field shifts from asking can machines exhibit intelligent behavior? to asking can machines exhibit intelligent behavior such that they are aligned, interpretable, and safe? These are philosophically distinct questions that may require discoveries that we cannot specify. We propose mechanisms to restore exaptive capacity without abandoning benchmarking: plural evaluation regimes, protected venues for non-comparable work, long-horizon funding, and training norms that encourage researchers to question selection rules, not only optimize within them.
Show more
Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance
Shiqiang Wang ⋅ Herbert Woisetschlaeger ⋅ Hans-Arno Jacobsen ⋅ Mingyue Ji
Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, *we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow*. We refer to such sequences as *data probes*. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.
Show more
Position: AI Evaluation Should Work With Humans
Jan Kulveit ⋅ Gavin Leech ⋅ Tomáš Gavenčiak ⋅ Raymond Douglas
We argue that the dominant paradigm of AI evaluation, which focuses on autonomous superhuman performance and so an implicit goal of replacing humans, is guiding AI development in the wrong direction. Instead, the AI community should pivot to evaluating the performance of human–AI teams. We argue that this collaborative shift in evaluation will foster AI systems that act as true complements to human capabilities and therefore lead to far better societal outcomes than the current process.
Show more
Position: Agent Evaluation Should Be Agentified for Openness, Standardization, and Reproducibility
Xiaoyuan Liu ⋅ Tianneng Shi ⋅ Wenbo Guo ⋅ Dawn Song
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. This position paper argues that the root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by assessor agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. This design separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and deployment; we provide recommended practices that allow both agent developers and benchmark designers to adopt AAA with minimal additional effort; and we show how this approach turns agent evaluation from ad-hoc integration work into a reusable, portable, and production-aligned process. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.
Show more
Position: Adopting AI in Practice Does Not Guarantee the Productivity Boost
Won Ik Cho ⋅ Seong-hun Kim ⋅ Geunhye Kim
This position paper argues that **adopting AI in organizational practice does not guarantee productivity gains, because human and environmental factors critically moderate the relationship between AI deployment and realized productivity improvements**. Following the advent of high-performance generative models, AI use has been rapidly encouraged in some sectors while being restricted in others. Most practitioners assume that AI brings productivity boosts owing to enhanced technical capabilities, but regardless of apparent performance advances in AI technology, human and environmental factors of the organization may substantially attenuate---or even negate---the effective productivity benefits. We identify five key moderating factors: human resource composition, baseline capability of individuals, learning curve of practitioners, incentives for fair use, and flexibility of objectives. Drawing on the partial equilibrium model of Gries and Naudé (2022), we argue that existing economic frameworks may inadvertently overlook these factors. We revise the existing framework to redefine effective organizational determinants and shed light to practical implications including industry and education, responding to alternative views and calling for action of stakeholders.
Show more
Position: Mechanisms for Aggregated Individual Reporting Should be Established for Post-Deployment Evaluation
Jessica Dai ⋅ Inioluwa Raji ⋅ Benjamin Recht ⋅ Irene Y. Chen
The need for developing model evaluations beyond static benchmarking, especially in the post-deployment phase, is now well-understood. At the same time, concerns about the concentration of power in deployed AI systems have sparked a keen interest in "democratic" or "public" AI. In this work, we bring these two ideas together by proposing mechanisms for aggregated individual reporting (AIR), a framework for post-deployment evaluation that relies on individual reports from the public. An AIR mechanism allows those who interact with a specific, deployed (AI) system to report when they feel that they may have experienced something problematic; these reports are then aggregated over time, with the goal of evaluating the relevant system in a fine-grained manner. **This position paper argues that individual experiences should be understood as an integral part of post-deployment evaluation, and that the scope of our proposed aggregated individual reporting mechanism is a practical path to that end.** On the one hand, individual reporting can identify substantively novel insights about safety and performance; on the other, aggregation can be uniquely useful for informing action. From a normative perspective, the post-deployment phase completes a missing piece in the conversation about "democratic" AI. As a pathway to implementation, we provide a workflow of concrete design decisions and pointers to areas for future research.
Show more
Position: State-of-the-Art Claims Require State-of-the-Art Evidence
YongKyung Oh
State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA implies robust superiority. It suggests that a model significantly outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly assumed property of superiority does not hold. These properties include meaningful effect size, consistency across tasks, or robustness to dataset removal. Instead, aggregate gains are frequently driven by outlier datasets. This fragility persists even in benchmarks with many tasks. We argue that claim language should reflect the strength of the underlying evidence. This requires no additional experiments, only honest reporting of what results actually show.
Show more
Position: Improved Documentation is Necessary for Benchmarking AI Systems in Geometry
Anna Genevaux ⋅ Simon Frieder
This position paper argues that documentation is infrastructure for reproducible geometry reasoning: a benchmark for formal geometry problems to test AI systems is not usable in research unless its documented vocabulary is matched by executable, versioned behavior and minimal runnable examples. We use JGEX (as implemented by Newclid) as a case study of how documentation--implementation gaps and missing examples can silently constrain expressivity, fragment tool interoperability, and bias benchmark construction. To make our point, we introduce "A JGEX Dataset", a curated collection of $78$ Euclidean geometry problems with (i) original natural-language statements and sources, (ii) a JGEX-oriented rewrite that makes formalization steps explicit, (iii) executable JGEX code validated under a pinned solver version, and (iv) rich metadata. To make the target language auditable, we also provide a predicate-level support matrix for the $33$ documented predicates, generated from minimal test instances, and categorize predicates as supported, unsupported, or unstable due to missing accessible examples. Finally, we release validation scripts and a concise tutorial with worked walk-throughs. Our broader claim is that benchmark authors, tool maintainers, and reviewers should treat language documentation and conformance evidence as first-class artifacts—on par with datasets and evaluation code—if cross-tool, cross-version reproducibility is the goal.
Show more
Position: Stop evaluating AI with human tests, develop principled, AI-specific tests instead
Tom Sühr ⋅ Florian Dorner ⋅ Olawale Salaudeen ⋅ Augustin Kelava ⋅ Samira Samadi
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-spec by laying out, end-to-end, how valid measurement instruments are constructed and validated and where the ontological error enters when a human-calibrated instrument is applied to LLMs.
Show more
Position: ICML Should Treat Hosted LLM APIs as Versioned Dependencies and Require Drift-Audit Artifacts
Utsav Gupta
This position paper argues that ICML should require a minimal drift-audit artifact for papers whose main claims materially rely on hosted LLM APIs. Hosted APIs can change behavior over time, undermining the scientific interpretability of results even when evaluation code and prompts are held fixed. While existing proposals address API contracts and change reporting, there is not yet a widely adopted, venue-aligned standard for attaching a minimal drift-audit artifact to results that rely on hosted endpoints. The paper proposes a lightweight artifact consisting of a small suite of invariant-checking probes (e.g., schema, tool-call, or refusal invariants), machine-readable provenance metadata, and a rerun script that can detect and characterize post-publication behavioral drift at bounded cost. It further argues that provider-side behavioral versioning and machine-readable changelogs are enabling infrastructure that would make drift-aware reporting more reliable and less burdensome. The paper concludes with concrete actions for conferences, providers, and tool builders, and with falsifiable predictions about improved replication stability and reduced time-to-diagnosis when results stop reproducing.
Show more
Position: Creating High-Fidelity Synthetic Training Data Should Employ Multi-level Optimization
Pengtao Xie ⋅ Li Zhang ⋅ Ruiyi Zhang
The reliance of machine learning (ML) models on large-scale, high-quality labeled training data incurs significant challenges in specialized domains where such data is expensive and difficult to obtain. A promising solution is the automatic creation of synthetic training data. However, current approaches — including data generation, automated annotation, and domain adaptation — often fail to explicitly use downstream model performance to guide the creation and refinement of synthetic training data. This position paper argues that multi-level optimization (MLO) is essential for producing high-fidelity synthetic data by enabling joint optimization of data generation, annotation, adaptation, and selection, all informed by downstream model performance. We advocate for MLO as a unified framework to address three critical challenges: (1) improving data generation by aligning synthetic data with model needs, particularly targeting class-specific deficiencies and worst-case robustness; (2) enhancing automated annotation through sequential verification and the use of large language models for more accurate labeling; and (3) enabling example-specific adaptation and selection to maximize data utility while preventing excessive over-adaptation. By facilitating end-to-end coordination across multiple learning stages, MLO offers a potential paradigm shift in synthetic data creation for data-scarce domains.
Show more
Position: Modular Memory is the Key to Continual Learning Agents
Vaggelis Dorovatas ⋅ Malte Schwerin ⋅ Andrew Bagdanov ⋅ Lucas Caccia ⋅ Antonio Carta ⋅ Laurent Charlin ⋅ CITEC Barbara Hammer ⋅ Tyler Hayes ⋅ Timm Hess ⋅ Christopher Kanan ⋅ Dhireesha Kudithipudi ⋅ Xialei Liu ⋅ Vincenzo Lomonaco ⋅ Jorge Mendez-Mendez ⋅ Darshan Patil ⋅ Ameya Pandurang Prabhu ⋅ Elisa Ricci ⋅ Tinne Tuytelaars ⋅ Gido M van de Ven ⋅ Liyuan Wang ⋅ Joost van de Weijer ⋅ Jonghyun Choi ⋅ Martin Mundt ⋅ Rahaf Aljundi
Foundation models have transformed machine learning through large-scale pretraining, massive parameterization, and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning, i.e., updating a single model’s parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. **Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale.** We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, thereby mitigating catastrophic forgetting and charting a practical roadmap toward continually learning agents.
Show more
Position: Neural Approximation Is Rarely Justified for Hard Combinatorial Problems
Pritish Chakraborty ⋅ Indradyumna Roy ⋅ Soumen Chakrabarti ⋅ Abir De
In recent years, there has been a surge in the application of neural approaches to NP-hard combinatorial problems such as subgraph isomorphism, maximum clique and the travelling salesman problem in graphs. These approaches are often evaluated as complete replacements of established combinatorial solver tools, with emphasis on solution quality and runtime. In this position paper, we argue that such wholesale replacements for touted faster inference or better solution quality should not be considered the primary motivation for neural surrogates, and a systematic evaluation of when neural methods are appropriate is required. Given our observations, we contend that in the absence of system-level requirements dictated by the task at hand, such as vector indexing and retrieval, or without the need for end-to-end differentiability, neural surrogates rarely offer compelling advantages over the standard combinatorial solver. In this vein, we develop a comprehensive report of where current neural methods fall short, and subsequently devise a diagnostic checklist for when neural methods are truly applicable.
Show more
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
Zijie Zhou
This position paper argues that LLM inference serving has outgrown generic heuristics and now demands mathematical optimization and algorithmic foundations. Despite rapid advances in serving systems such as vLLM and SGLang, their algorithmic cores remain largely unchanged from classical distributed computing: request routing uses join-shortest-queue or round-robin, scheduling defaults to FIFO, and KV cache eviction follows LRU. These general-purpose policies ignore the distinctive structure of LLM inference—dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. We contend that the field must develop mathematical models capturing these characteristics, enabling the design of algorithms with provable performance guarantees across diverse workloads, rather than heuristics that may succeed in some scenarios but fail unpredictably in others. Emerging work at the intersection of operations research and ML systems demonstrates that principled methods can match or exceed heuristic performance while providing theoretical guarantees. We call on the community to recognize algorithmic design for LLM serving as a research frontier.
Show more
Position: Federated Learning is a Lens towards a Democratized Future for the Scaling Law Era
Harry Jiang ⋅ Baris Askin ⋅ Gauri Joshi ⋅ Carlee Joe-Wong
Machine learning (ML) systems have grown significantly in size and popularity over recent years. However, the data and computation power supply chains which have helped fuel this growth have not been built without controversy. In particular, some of the data used to train these models may have been used without permission, while the growing appetite for compute power in model training increasingly incentivizes consolidation of access to larger players. As some stakeholders, such as data owners and everyday consumers of the Internet, have felt left behind by the emerging ML ecosystem, we seek to use federated learning paradigm as a model and motivation to develop a more democratized future for the ML community: one that is more decentralized, cooperative, and accountable. This position paper argues that the original proposition of federated learning as a framework enabling cooperation, privacy, and decentralization is still relevant today, even after the emergence of large foundation model- and scaling law-driven ML research, and that FL can inspire alternative ML ecosystems which alleviate and avoid the current frictions of large ML systems.
Show more
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered
Sijia Liu ⋅ Yicheng Lang ⋅ Soumyadeep Pal ⋅ Changsheng Wang ⋅ Yancheng Huang ⋅ Chongyu Fan ⋅ James Diffenderfer ⋅ Bhavya Kailkhura ⋅ Yihua Zhang
Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance–query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.
Show more
Position: Predictive Uncertainty Is Not Enough -- Joint Distribution for Full Uncertainty Representation
Adria Aldoma ⋅ Unai Gurbindo ⋅ Axel Brando
When AI is deployed in safety-critical domains, erroneous and overconfident predictions can have severe consequences. Therefore, comprehensive uncertainty quantification (UQ) should be a foundational requirement for responsible decision-making. Current UQ methods based on epistemic and aleatoric decomposition have been found insufficient for fully understanding the problem. We add that this limitation is further compounded by the systematic isolation of these terms without considering uncertainty about the domain. Our position claims that any meaningful analysis must account for three sources of uncertainty -domain, epistemic, and aleatoric-, and that only the joint distribution $p(x,y|\mathcal{D})$ provides a coherent representation of uncertainty. We begin by mirroring prior findings that show the application of information-theoretic UQ methods to ID and OOD settings is suboptimal, primarily due to the inherent difficulty of disentangling epistemic and aleatoric components. Based on this, we support that modeling the unconditional distribution $p(x|\mathcal{D})$ is required to account for input validity, resulting in a third class of uncertainty: \emph{domain} uncertainty. Finally, by considering both the domain and the conditional distribution $p(y|x,\mathcal{D})$, we argue that their product $p(x,y|\mathcal{D})$ fully encapsulates all sources of uncertainty.
Show more
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
Emanuel Sommer ⋅ David Rügamer
The practical adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs) remains limited, partly due to persistent misconceptions about the feasibility and efficiency of sampling. This position paper argues that SAI has achieved computational parity with optimization-based methods and is at the verge of superseding such methods for effective and efficient inference in BNNs. This development should be in the interest of the whole community, promoting BNNs as a principled paradigm with its long-standing yet unfulfilled promise of providing principled uncertainty quantification for neural networks. SAI can even do more—yielding superior prediction performance through model averaging, serving as the foundation for a plethora of possible downstream tasks, and providing crucial insights into the landscape of BNNs. In order to make such a change happen and unfold the potential of sampling, overcoming current misconceptions is a necessary first step. The next step is to realign research efforts toward addressing remaining challenges in SAI. In particular, the community must focus on two core problems: sufficient exploration of the posterior landscape and high-fidelity distillation of posterior samples for efficient downstream inference. By addressing conceptual and practical obstacles, we can unlock the full potential of SAI and establish it as a central tool in Bayesian deep learning.
Show more
Position: Epistemic uncertainty estimation methods are fundamentally incomplete
Sebastian Jimenez ⋅ Mira Juergens ⋅ Willem Waegeman
Identifying and disentangling sources of predictive uncertainty is essential for trustworthy supervised learning. We argue that widely used second-order decomposition-based approaches to uncertainty quantification are fundamentally incomplete. First, we show that unaccounted bias contaminates uncertainty estimates by overestimating aleatoric (data-related) uncertainty and underestimating the epistemic (model-related) counterpart, leading to systematically incorrect uncertainty quantification. Second, we demonstrate that existing methods capture only partial contributions to the variance-driven part of epistemic uncertainty; different approaches account for different variance sources, yielding estimates that are incomplete and difficult to interpret. Together, these results highlight that current epistemic uncertainty estimates can only be used in safety-critical and high-stakes decision-making when limitations are fully understood by end users and acknowledged by AI developers.
Show more
Position: RL Should Be Used to Adjust Foundation Models, NOT Abused
Ting Huang ⋅ Zeyu Zhang ⋅ Hao Tang
This position paper argues that reinforcement learning (RL) should be used to *adjust* foundation models after pretraining and cold-start supervision, not *abused* as a default recipe for capability creation or early-stage training. We view RL as a high-cost, high-leverage post-training operator that reallocates probability mass toward behaviors a model can already express, but rarely creates new reasoning capacities from scratch in a compute-efficient, stable, and controllable way. This distinction matters now because “RL-zero” narratives risk normalizing expensive and brittle RL-first pipelines as the primary path to reasoning, even though practice increasingly shows that cold-start supervision is a prerequisite for reliable RL and that RL is most effective as targeted refinement. Across modalities and domains, we emphasize a recurring regularity: supervision establishes usable reasoning structure, while RL mainly sharpens correctness, consistency, and constraint satisfaction, especially under hard constraints or distribution shift. We further argue for reward minimalism: simple, verifiable rewards often suffice and reduce proxy-driven failure modes relative to over-engineered reward models. Finally, we discuss how self-supervised RL can support self-evolution when grounded in verifiable signals and structured interaction environments. Together, these arguments motivate treating RL as a disciplined adjustment stage with explicit entry criteria and compute-accountable evaluation.
Show more
Position: Deployed Reinforcement Learning should be Continual
Parnian Behdin ⋅ Kevin Roice ⋅ Golnaz Mesbahi
Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world, until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.
Show more
Position: Collaborative Agentic AI Needs Interoperability Across Ecosystems
Rishi Sharma ⋅ Martijn de Vos ⋅ Pradyumna Chari ⋅ Ramesh Raskar ⋅ Anne-Marie Kermarrec
Collaborative agentic AI is projected to transform entire industries by enabling AI-powered agents to autonomously perceive, plan, and act within digital environments. Yet, current solutions in this field are all built in isolation, and we are rapidly heading toward a landscape of fragmented, incompatible ecosystems. In this position paper, we argue that interoperability, achieved by the adoption of minimal standards, is essential to ensure open, secure, web-scale, and widely-adopted agentic ecosystems. To this end, we devise a minimal architectural foundation for collaborative agentic AI, named Web of Agents, which is composed of four components: agent-to-agent messaging, interaction interoperability, state management, and agent discovery. Web of Agents adopts existing standards and reuses existing infrastructure where possible. With Web of Agents, we take a first but critical step toward interoperable agentic systems and offer a pragmatic path forward before ecosystem fragmentation becomes the norm.
Show more
Position: Agentic AI systems should be making Bayes-consistent decisions
Theodore Papamarkou ⋅ Pierre Alquier ⋅ Matthias Bauer ⋅ Wray Buntine ⋅ Andrew Davison ⋅ Gintare Karolina Dziugaite ⋅ Maurizio Filippone ⋅ Andrew Y. K. Foong ⋅ Vincent Fortuin ⋅ Dimitris Fouskakis ⋅ Jes Frellsen ⋅ Eyke Hüllermeier ⋅ Theofanis Karaletsos ⋅ Mohammad Emtiyaz Khan ⋅ Nikita Kotelevskii ⋅ Salem Lahlou ⋅ Yingzhen Li ⋅ Fang Liu ⋅ Clare Lyle ⋅ Thomas Moellenhoff ⋅ Konstantina Palla ⋅ Maxim Panov ⋅ Yusuf Sale ⋅ Kajetan Schweighofer ⋅ Artem Shelmanov ⋅ Siddharth Swaroop ⋅ Martin Trapp ⋅ Willem Waegeman ⋅ Andrew Wilson ⋅ Alexey Zaytsev
LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.
Show more
Position: Multi-Agent Systems Should Prioritize Concurrency Control
Xin Yang ⋅ Letian Li ⋅ Zimo Ji ⋅ Terry Zhang ⋅ Wenyuan Jiang
LLM-based multi-agent systems (MAS) promise scalable collaboration, yet adding agents often *reduces* reliability. This position paper argues that many MAS failures are fundamentally **concurrency control problems**: agents concurrently read and write shared state, and long LLM inference windows amplify the risk of stale reads, lost updates, and inconsistent outcomes. Failure modes commonly attributed to "coordination" or "communication" breakdowns can be mapped directly onto classical concurrency anomalies. Rather than treating these as emergent behaviors to be solved by better prompting or more capable models, we contend that MAS frameworks should incorporate explicit concurrency control mechanisms: conflict detection, isolation guarantees, and structured access to shared resources. Concurrency control should be a first-class design concern, not an afterthought.
Show more
Position: Agentic AI Is a Foreseeable Pathway to AGI
Junwei Liao ⋅ Shuai Li ⋅ Muning Wen ⋅ Jun Wang ⋅ Weinan Zhang
Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is enough to achieve universal super-intelligence. Instead, we identify Agentic AI as the necessary evolution for handling complex, real-world task distributions to achieve AGI in the human world. Through concrete theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, evolving from simple routing mechanisms to general Directed Acyclic Graphs (DAGs) of Agents. We demonstrate that Agentic AI offers superior generalization and efficiency. Finally, we reinterpret the instability of current multi-agent frameworks and call for more future actions on Agentic AI.
Show more
Position: Multi-Agent Explainability Needs Contracts Before Methods
Hak Kim ⋅ Benjamin Huh ⋅ Soroush Vosoughi
Multi-Agent Systems (MAS) are deployed at unprecedented scale—from warehouse robot fleets to autonomous vehicle networks to collaborative LLM agents—yet methods for explaining their behavior remain fragmented and underspecified. We analyze 2,381 MAS-related papers from top machine learning venues (2021–2025) and find systematic gaps: 65% omit stakeholder specifications, 76% lack quantitative evaluation bounds, and 99% ignore auditability requirements. These gaps render current MAS XAI research non-comparable, non-reproducible, and disconnected from deployment requirements. We argue that MAS XAI research requires explicit specification of two contracts before developing methods. The **Research Contract** defines six elements: explanandum, stakeholder, intervention unit, evaluation bounds, adversarial context, auditability. The **Agent Contract** defines expected behaviors through obligations, permissions, prohibitions, violation criteria, and accountability chains—providing the baseline against which deviations are explained. These contracts are method-agnostic and architecture-agnostic, applicable to LLM-based, learning-based, and hybrid MAS. Through case studies spanning warehouse robotics, autonomous vehicles, and LLM agent systems, we demonstrate that contracts transform vague post-hoc descriptions into verifiable, actionable, and comparable explanations. We call on researchers to adopt contracts in their work, conferences to encourage specification in submissions, and platforms to integrate contract templates into MAS benchmarks.
Show more
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
Hongru WANG ⋅ Cheng Qian ⋅ Manling Li ⋅ Jiahao Qiu ⋅ Boyang XUE ⋅ Mengdi Wang ⋅ Heng Ji ⋅ Amos Storkey ⋅ Kam-Fai Wong
As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that \textit{agents should invoke external tools only when epistemically necessary}. Here, epistemic necessity means that a task cannot be completed reliably via the agent’s internal reasoning over its current context, without any external interaction. We introduce the \textit{\textbf{Theory of Agent (ToA)}}, a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.
Show more
Position: Digital Agents Require Unified Agent-Native Environments
Yiran Wu ⋅ Jiale Liu ⋅ Jieyu Zhang ⋅ Yaolun Zhang ⋅ Shilong Liu ⋅ Chi Wang ⋅ Mengdi Wang ⋅ Huazheng Wang ⋅ Qingyun Wu
Large language models (LLMs) are increasingly deployed as digital agents that perform multi-step digital work on a computer, but the environments in which they operate remain fragmented and task-specific. Our position is that digital agents need Agent-Native Computer: interfaces that expose system capabilities through compositional observation and action spaces aligned with LLM strengths. To ground this position, we showcase AgentVM, an environment running on top of a modern operating system, which integrates Graphical User Interface (GUI)-based and text-based interactions over a shared system state, and factors interaction into modular environment views. Through quantitative and qualitative analysis, we show that a unified agent-native computer is essential for building general-purpose digital agents.
Show more
Position: Agentic Systems Should be General
Elron Bandel ⋅ Asaf Yehudai ⋅ Alexandre Lacoste ⋅ Avijit Ghosh ⋅ Graham Neubig ⋅ Margaret Mitchell ⋅ Michal Shmueli-Scheuer ⋅ Leshem Choshen
We call for the development of agentic systems that thrive in new environments. Agentic systems, comprising foundation models, tools, and an execution strategy, have demonstrated strong capabilities, yet their development is often constrained by narrow benchmarks and their operation is siloed to limited environments. This paper advocates for developing general, adaptive agents that excel across diverse environments, from terminals and web interfaces to biological and embodied settings. We examine current limitations, explain the potential of increased generality, and identify immediate development priorities. Finally, we argue that protocols and evaluation must prioritize adaptiveness to foster a shared ecosystem for general-purpose agentic systems.
Show more
Position: Solipsistic superintelligence is unlikely to be cooperative
Rakshit Trivedi ⋅ Natasha Jaques ⋅ Logan Cross ⋅ Alexander Vezhnevets ⋅ Joel Z Leibo
AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents under stationary-environment assumptions, treating the world as an exogenous source of feedback. This position paper argues that a solipsistic superintelligence---an extremely capable solver of stationary problems---is unlikely to be cooperative. Deployment induces endogenous nonstationarity: other agents adapt, producing best-response dynamics that reshape the environment the AI was trained to navigate. The result is a train--test--deploy gap where historical distributions diverge from deployment realities; the more aggressively a solipsistic superintelligence exploits historical regularities, the faster it renders them obsolete. Cooperation is therefore not an added capability but an equilibrium property that solipsistic superintelligence cannot guarantee. We call for a multi-agent-first research paradigm treating strategic interdependence as a core design principle, alongside dynamic evaluation: testbeds where distributions are generated by adaptive counterparties, and metrics prioritizing equilibrium stability over single-score task success.
Show more
Position: Artificial Intelligence Needs Meta Intelligence - the Case for Metacognitive AI
Sergei Chuprov ⋅ Richard Lange ⋅ Leon Reznik ⋅ Paulo Shakarian ⋅ Raman Zatsarenko ⋅ Dmitrii Korobeinikov
This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance's difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.
Show more
Position: From Crowdsourcing to Crowd-LLM-Sourcing and LLM-Sourcing
Jiyi Li
Crowdsourcing has been widely adopted for large-scale data collection and problem solving, yet its outcomes are often noisy and inconsistent, making quality control and aggregation central concerns. Meanwhile, Large Language Models (LLMs) have shown strong capabilities in generation, annotation, evaluation, and reasoning. These developments give rise to a new paradigm at the intersection of crowdsourcing and LLMs, which we term Crowd-LLM-Sourcing, encompassing two directions: (1) Crowd-LLM Collaboration, where humans and LLMs jointly participate in workflows, and (2) LLM-Sourcing Inspired by Crowdsourcing, where crowdsourcing principles guide LLM-driven generation, annotation, evaluation, and inference. Many existing studies on LLMs overlook decades of prior work in crowdsourcing, even though the two domains are grounded in closely related principles on some topics. Our central position is that, in scenarios where an LLM can be regarded as an LLM worker, LLM research should draw upon the rich body of crowdsourcing literature. At the same time, LLM workers differ fundamentally from human workers. Identifying how crowdsourcing mechanisms should be adapted, opens a new research agenda for collective intelligence with model-based agents.
Show more
Position: Peer Review in ML/AI Conferences Should Separate Publication from Presentation and Offer Non-Anonymous Review Tracks
Nihar Shah
In this position paper, we enumerate a number of problems with the current peer-review process based on extensive empirical evidence. We argue for two structural reforms: (1) separating publication from presentation via a four-step process that first evaluates correctness, publishes all sound papers, then uses community-based ratings to select presentations; and (2) offering parallel anonymous and non-anonymous review tracks, where the non-anonymous track releases all review data publicly to increase accountability and generate valuable research datasets. We argue how our proposed policies can mitigate these problems. We urge the community to leverage the learnings from the experiments conducted in peer-review processes and incorporate evidence-based policy design.
Show more
Position: Stop Preaching and Start Practising Data Frugality for Responsible Development of AI
Sophia N. Wilson ⋅ Guðrún Guðmundsdóttir ⋅ Andrew Millard ⋅ Raghavendra Selvan ⋅ Sebastian Mair
This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that coreset-based subset selection can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preach to concrete practice for responsible development of AI.
Show more
Position: Machine Learning Research Should Be Guided by Explicit, Pluralistic Models of Human Purpose
Utsav Gupta
Machine learning systems increasingly shape attention, work, education, and social life, yet ML research often treats the question "what is this for?" as external, relying on proxies such as accuracy, engagement, or preference satisfaction. This position paper argues that ML research should be guided by explicit, pluralistic models of human purpose, understood as supporting people's capacity to pursue meaningful, self-chosen life projects with agency. The paper proposes three community practices: (i) purpose articulation, a structured "Purpose Statement" that specifies intended beneficiaries, mechanisms, and falsifiable failure modes; (ii) purpose evaluation, which measures impacts on agency and meaning alongside task performance and harm; and (iii) purpose governance, which updates purpose frameworks through transparent, participatory processes to reduce unaccountable value-setting. This framing enables concrete technical research directions, including objective design beyond preference satisfaction, benchmarks for agency and meaning, pluralistic system behavior, and institution-aware alignment. The paper provides stakeholder-differentiated recommendations for researchers, benchmark creators, conference organizers, and funders, and addresses credible objections including value neutrality, feasibility and measurement validity, the claim that harm prevention is sufficient, and risks of ideological capture or paternalism.
Show more
Position: Token Taxes Can Mitigate AI's Economic Risks
Lucas Irwin ⋅ Tung-Yu Wu ⋅ Fazl Barez
AI-driven automation threatens to erode government tax bases, lower living standards, and disempower citizens—risks that mirror the 40-year stagnation of wages during the first industrial revolution. While AI safety research has focused primarily on capability risks, comparatively little work has studied how to mitigate the economic risks of AI. This position paper argues that technical governance researchers should prioritize the study of token taxes: usage-based surcharges on model inference applied at the point of sale. We situate token taxes within previous proposals for robot taxes and identify two key advantages: they are enforceable through existing compute governance infrastructure, and they capture value where AI is used rather than where models are hosted. We then present a research roadmap. For enforcement, we outline a staged audit pipeline---black-box token verification, norm-based tax rates, and white-box audits---and identify open technical problems at each stage. For impact, we highlight the need for economic modeling of cost pass-through and deadweight loss. Finally, we discuss why FLOP taxes may be preferable, token taxes could stifle innovation, and that AI superpowers can veto such measures.
Show more
Position: Stop Automating Peer Review Without Rigorous Evaluation
Joachim Baumann ⋅ Jiaxin Pei ⋅ Sanmi Koyejo ⋅ Dirk Hovy
Large language models offer a tempting solution to address the peer review crisis. This position paper argues that **today's AI systems should not be used to produce paper reviews**. We ground this positing in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a *hivemind effect* of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through *paper laundering*: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are *necessary but not sufficient* conditions for automation. We argue that **addressing the peer review crisis requires a science of peer review automation**---not general-purpose LLMs deployed without rigorous evaluation.
Show more
Position: The AI Imperative: Scaling High-Quality Peer Review in Machine Learning
Qiyao Wei ⋅ Samuel Holt ⋅ Jing Yang ⋅ Markus Wulfmeier ⋅ Mihaela van der Schaar
Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.
Show more
Position: If open source is to win, it must go public
Joshua Tan ⋅ Nicholas Vincent ⋅ Katherine Elkins ⋅ Magnus Sahlgren ⋅ Joseph Low ⋅ David Pham ⋅ Sampo Pyysalo ⋅ Jenia Jitsev
Open source projects have made incredible progress in producing widely usable machine learning models and systems, but open source alone will face challenges in fully democratizing access to AI. Unlike previous generations of open source software, open source and open weight AI models require substantial resources to activate and maintain—e.g., data and compute for pre-training, post-training, and deployment—which only a few actors can currently provide. This position paper argues that open source AI must be complemented by public AI: infrastructure and institutions that ensure models are accessible, sustainable, and governed in the public interest. To achieve the full promise of AI models as prosocial public goods, we need to build public infrastructure to power and deliver open source software and models.
Show more
Position: The Term “Machine Unlearning” Is Overused in LLMs
Sangyeon Yoon ⋅ Yeachan Jun ⋅ Albert No
Large language models increasingly face demands to "forget" training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. **This position paper argues that *machine unlearning* is overused as a term in LLM research and should be reserved for dataset-defined deletion: removing the training influence of a precisely specified forget set such that the resulting model is (approximately) indistinguishable from retraining without that data.** We contend that many tasks currently labeled "unlearning" (e.g., refusal for harmful requests, entity/knowledge removal, or targeted suppression) pursue different, often policy-dependent objectives and therefore require different terminology and baselines (e.g., alignment, suppression, editing, obfuscation). We further argue that this confusion is not cosmetic: because papers make different implicit guarantees under the same label, metrics and benchmarks are frequently reused outside their intended scope, rewarding surface-level non-disclosure (e.g., low ROUGE/forget accuracy) even when retraining-equivalence is not tested and derived capabilities remain. We conclude by calling for stricter terminology tied to explicit guarantees and reference models, and for evaluations that match the claimed objective.
Show more
Position: Want Better ML Reviews? Stop Asking Nicely and Start Incentivizing with a Credit System
Shaochen (Henry) Zhong
With soaring submission counts, stricter reciprocal review policies, widespread adoption of platforms like OpenReview, and without the offsetting pressure of publication fees, the machine learning (ML) community has one of the largest scholarly presences among all scientific fields. And yet, **almost *everyone* has *many* unpleasant things to share about their review experience.** Worse, there is little public space to seriously discuss — let alone debate — what makes a review system effective or how it might be improved. In this position paper, we expand our discussion on two core problems: *How can we reasonably limit the number of submissions?* and *How can we incentivize good and discourage bad review practices?* We first assess the strengths and shortcomings of existing attempts to address such problems. Specifically, we present four takes on some popular conference mechanisms and propose two alternative designs for improvement. Our general position is that meaningful improvement in ML peer review won't come from polite best-practice suggestions tucked into Calls for Papers or Reviewer Guidelines — it requires **enforceable yet fine-grained procedural safeguards** paired with **a currency-like credit system (what we call *OpenReview Points*)**. ML practitioners can “earn” such points by contributing good review practices, and “spend” across one or multiple major conferences to redeem different kinds of “perks” — such as complimentary registration or the right to request additional review resources.
Show more
Position: Irresponsible AI: big tech’s influence on AI research and associated impacts
Alex Hernandez-Garcia ⋅ Alexandra Volokhova ⋅ Ezekiel Williams ⋅ Dounia Shaaban Kabakibo ⋅ Mélisande Teng
The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. We develop this argument by laying out the factors through which this influence leads to irresponsible AI. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we highlight the need for AI researchers to counter big tech's influence, and review and propose strategies that build on the responsibility of implicated actors and collective action.
Show more
Position: Regulating Algorithms Is Not Enough. A Study of Content Discovery in Online Platforms
Rebecca Salganik ⋅ Guillaume Salha-Galvan ⋅ Adelaida Afilipoaie ⋅ Gustavo Ferreira ⋅ Valdy Wiratama ⋅ Anson Kahng ⋅ Jian Kang ⋅ Heritiana Ranaivoson
Recent AI regulation has largely focused on algorithmic components such as recommender models, ranking systems, and profiling mechanisms. At the same time, cultural and digital policy agendas increasingly frame discovery as a key objective, aiming to promote exposure diversity and cultural representation. We argue that these outcomes cannot be effectively governed through algorithm-centric approaches alone. Discovery does not arise from individual algorithms in isolation, but from interactions among models, interfaces, user behavior, economic incentives, and cultural norms. We introduce the Cultural Expressions Discovery Circuit (CEDC), an interdisciplinary framework that models discovery as an emergent socio-technical process. Through this lens, we illustrate how certain regulatory approaches struggle to align with broader cultural objectives. Furthermore, we highlight how socio-technical analysis can help inform both technical research and the governance of cultural expressions in online platforms.
Show more
Position: AI Welfare Is Bullshit
Yunze Xiao ⋅ Gordon Dai ⋅ Shahan Ali Memon ⋅ Jen-Tse Huang ⋅ Maarten Sap ⋅ Mona Diab
In this position paper, we argue that for AI systems, ``welfare'' is a choice in mechanism and evaluation, rather than an empirically discoverable property, because welfare assessment lacks an external validation channel: there is no independent, intervention-based test that can falsify a welfare metric or adjudicate among competing accounts of what welfare requires. We formalize this diagnosis using evaluation theory, emphasizing that in AI the subject, indicators, and metrics are co-engineered, so proposed welfare evidence can be manufactured or suppressed by ordinary development decisions. We then analyze two institutional failure modes if welfare scorecards are nonetheless used in release and access decisions: they expand procedural gates around routine ML work and they enable organizations to reframe discretionary choices about liability, publicity, and risk posture as moral necessity. We conclude with guidance for research and governance: prohibit welfare scorecards as release gates, disallow appeals to model welfare as a reason to resist auditing and oversight, and require that any restrictions on AI development be justified by externally verifiable harms rather than untestable welfare claims.
Show more
Position: AI Leaderboards Are Underserving the Global South: A Case Study from India
Sourav Banerjee ⋅ Saikat Saha
This position paper argues that AI leaderboards are structurally ill-suited to serving the Global South because they lack independent governance, conflict-of-interest policies, and mechanisms for metric evolution. The barrier is not missing data; high-quality regional benchmarks already exist: IndicSUPERB, MILU, and LAHAJA for India; IrokoBench for Africa; AlGhafa for Arabic. The barrier is institutional design. Global leaderboards do not include these benchmarks, and no governance mechanism compels them to do so. Commercial pressure corrects leaderboard failures when paying customers in the Global North are affected. The Global South lacks equivalent leverage. Without governance, failures affecting Hindi, Swahili, or Arabic speakers persist indefinitely as documented but unaddressed gaps. Using India as a case study (1.4 billion people, 22 scheduled languages, high-quality benchmarks, but no trusted aggregation), we report findings from a consultation with 58 AI practitioners showing consistent preference for formal governance and disclosure-based conflict management. The solution is not more data but better institutions: regional leaderboards with independent governance from the start.
Show more
Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities
Devon Jarvis ⋅ Richard Klein ⋅ Benjamin Rosman ⋅ Steven James ⋅ Stefano Sarao Mannelli
Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratise AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.
Show more
Position: Neglecting the Sustainability of AI is Fuelling a Global AI Arms Race
Pedram Bakhtiarifard ⋅ Pınar Tözün ⋅ Christian Igel ⋅ Raghavendra Selvan
Sustainability encompasses three key facets: economic, environmental, and social. However, the nascent discourse that is emerging on sustainable artificial intelligence (AI) has predominantly focused on the environmental sustainability of AI, often neglecting the economic and social aspects. Achieving truly sustainable AI necessitates addressing the tension between its climate awareness, which emphasizes the need to mitigate AI's environmental impacts, and its social sustainability, which hinges on equitable access to AI development resources. The concept of resource awareness advocates for AI sovereignty through broader access to the infrastructure required to develop AI. Yet, this push for improving accessibility often overlooks the environmental costs of expanding such resource usage. This position paper argues that reconciling climate awareness and resource awareness is essential to realizing sustainable AI and neglecting these factors fuelling the global AI arms race. By applying the base-superstructure framework from historical materialism, we analyze how the material conditions are shaping the current AI progress and the discourse surrounding it. We also introduce the Climate and Resource Aware Machine Learning (CARAML) framework to address the conflict between climate and resource awareness of AI, with actionable recommendations spanning individual, community, industry, government, and global levels to achieve sustainable AI.
Show more
Position: AI/ML Deepfake Research is Misaligned with AI Generated Non-Consensual Intimate Imagery (AIG-NCII)
Qiwei Li ⋅ Wells Lucas Santo ⋅ Sarita Schoenebeck ⋅ Eric Gilbert
AI-generated non-consensual intimate imagery (AIG-NCII) is not adequately addressed in AI/ML literature regarding AI-generated media, commonly referred to as "deepfakes". While research on deepfakes currently focuses on its epistemic harms—or harms relating to truth and authenticity—this is misaligned with the dominant reality of generative AI abuse involving sexualized imagery. We conduct a landscape analysis of highly-cited works to demonstrate that technical interventions addressing deepfakes almost entirely ignore AIG-NCII, limiting the research ecosystem to authenticity detection tools. In this position paper, we argue that existing interventions address viewer-centric epistemic harms, such as fraud or scams, but ignore subject-centric dignity harms, such as AIG-NCII. We illustrate that knowing an image is synthetic does not mitigate harms to subjects and may, in some cases, even exacerbate them. We conclude by offering recommendations to realign the field, including updating threat models to consider subject-centered harms and addressing AIG-NCII in AI safety research. Finally, we caution that researchers should only engage in this high-risk domain if they implement safety guardrails for both subjects and researchers and establish partnerships with domain experts in sexual violence prevention.
Show more
Position: Unplugging a Seemingly Sentient Machine Is the Rational Choice — A Metaphysical Perspective
Erik Bekkers ⋅ Anna Ciaunica
Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible to unplug it? What if limited resources force a choice between unplugging such a pleading AI or a silent pre-term infant? We term this the unplugging paradox. This position paper critically examines the deeply ingrained physicalist assumptions—specifically computational functionalism—that keep this dilemma afloat. We introduce Biological Idealism, a framework that—unlike physicalism—remains logically coherent and empirically consistent. In this view, conscious experiences are fundamental and autopoietic life its necessary physical signature. This yields a definitive conclusion: AI is at best a functional mimic, not a conscious experiencing subject. We discuss how current AI consciousness theories erode moral standing criteria, and urge a shift from speculative machine rights to protecting human conscious life. The real moral issue lies not in making AI conscious and afraid of death, but in avoiding transforming humans into zombies.
Show more
Position: Robust AI Personalization Will Require a Human Context Protocol
Anand Shah ⋅ Tobin South ⋅ Talfan Evans ⋅ Hannah Kirk ⋅ Jiaxin Pei ⋅ Andrew Trask ⋅ Glen Weyl ⋅ Michiel Bakker
Personalization underpins the modern digital economy. Today, personalization is largely implemented through provider-managed infrastructure that infers user preferences from behavioral data, with limited portability or user control. However, large language models (LLMs) are increasingly being used to perform tasks on users' behalf. The age of LLMs for the first time provides a path to a more controllable and interpretable personalization paradigm, grounded in user-expressed natural language preferences and context. In this position paper, we argue that to provide robust and user-centric personalization, we need a new Human Context Protocol (HCP) to represent and share personal preferences across AI systems. HCP treats preferences as a portable, user-governed layer in the personalization stack, enabling interoperability, scoped access, and revocation. Along with a working prototype to ground discussion, we consider counterarguments along adoption dynamics and market incentives, high-stakes use cases, and outline novel paths via the HCP towards trustworthy personalization in the human-AI economy.
Show more
Position: No Retroactive Cure for Infringement during Training
Satoru Utsunomiya ⋅ Masaru Isonuma ⋅ Junichiro Mori ⋅ Ichiro Sakata
As generative AI faces intensifying legal challenges, the machine learning community has increasingly relied on *post-hoc mitigation*---especially machine unlearning and inference-time guardrails---to argue for compliance. **This paper argues that such post-hoc mitigation methods cannot retroactively cure liability from unlawful acquisition and training, because compliance hinges on data lineage, not the outputs.** Our argument has three parts. First, unauthorized copying/ingestion can be a legally complete *completed act*, and model weights may operate as *fixed copies* that retain training-derived expressive value, making later filtering beside the point for infringement. Second, *contract* and *tort/unfair-competition* rules---via licenses, terms of service, and anti-free-riding principles---can independently restrict access and use, often bypassing copyright defenses (e.g., fair use or TDM exceptions). Third, since value from protected inputs can persist in weights, remedies such as *unjust enrichment* and *disgorgement* may require stripping gains and, in some cases, reaching the model itself. We therefore argue for a shift from *Post-Hoc Sanitization* to verifiable *Ex-Ante Process Compliance*.
Show more
Position: Evaluating LLMs in Finance Requires Explicit Bias Consideration
Yaxuan Kong ⋅ Hoyoung Lee ⋅ Yoontae Hwang ⋅ Alejandro Lopez-Lira ⋅ Bradford Levy ⋅ Dhagash Mehta ⋅ Qingsong Wen ⋅ CHANYEOL CHOI ⋅ Yongjae Lee ⋅ Stefan Zohren
Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up. Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for any deployment claim. We identify five recurring biases in financial LLM applications. They include look-ahead bias, survivorship bias, narrative bias, objective bias, and cost bias. These biases break financial tasks in distinct ways and they often compound to create an illusion of validity. We reviewed 164 papers from 2023 to 2025 and found that no single bias is discussed in more than 28 percent of studies. This position paper argues that **bias in financial LLM systems requires explicit attention and that structural validity should be enforced before any result is used to support a deployment claim.** We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design. The material is available at https://anonymous.4open.science/r/Fin-LLM-Checklists-8557/.
Show more
Position: Multiple Definitions & Unrealistic Assumptions of Model Collapse Distract from Real World Threats
Rylan Schaeffer ⋅ Joshua Kazdan ⋅ Alvan Arulandu ⋅ Sanmi Koyejo
The proliferation of AI-generated content online has fueled concerns over \textit{model collapse}, a degradation in future generative models' performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society's current trajectory have received disproportionately less attention.
Show more
Position: Time to Close The Validation Gap in LLM Social Simulations
Maximilian Puelma Touzel ⋅ Sneheel Sarangi ⋅ Aurélien Bück-Kaeffer ⋅ Zachary Yang ⋅ Jean-François Godbout ⋅ Reihaneh Rabbany
LLM-based social simulations—in which many language model agents interact over multiple turns—are rapidly proliferating across policy analysis, epidemiology, and computational social science. Yet the field lacks consensus on how to validate these simulations, with evaluation methods that are sparse, inconsistent, and rarely shared across disciplinary silos. We argue this creates a serious risk: premature deployment of unvalidated simulators in high-stakes domains. Our position is that the field must pivot from expansion to consolidation, prioritizing methodological standardization—shared benchmarks, open data, and reproducible evaluation protocols grounded in social science and complex systems research. We outline a concrete research program organized around specific learning problems/benchmarks, providing a path toward answering the fundamental question: when are LLM social simulations useful modelling objects?
Show more
Position: Knowing Isn’t Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight
Kirandeep Kaur ⋅ Xingda Lyu ⋅ Chirag Shah
Generative AI agents equate *understanding* with resolving explicit queries, an assumption that confines interaction to what users can articulate. This assumption breaks down when users themselves lack awareness of what is missing, risky, or worth considering. In such conditions, proactivity is not merely an efficiency enhancement, but an epistemic necessity. We refer to this condition as *epistemic incompleteness*: where progress depends on engaging with unknown *unknowns* for effective partnership. Existing approaches to proactivity remain narrowly anticipatory, extrapolating from past behavior and presuming that goals are already well defined, thereby failing to support users meaningfully. However, surfacing possibilities beyond a user’s current awareness is not inherently beneficial. Unconstrained proactive interventions can misdirect attention, overwhelm users, or introduce harm. Proactive agents, therefore, require *behavioral grounding*: principled constraints on *when, how*, and to *what extent* an agent should intervene. We advance the position that **generative proactivity must be grounded both epistemically and behaviorally**. Drawing on the *philosophy of ignorance* and *research on proactive behavior*, we argue that these theories offer critical guidance for designing agents that can engage responsibly and foster meaningful partnerships.
Show more
Position: EU AI Act's Research Exemptions Can Break the Publication Norms of Major AI Conferences
Alina Wernick ⋅ Kristof Meding
The EU has become one of the vanguards in regulating the digital age. A particularly important regulation in the Artificial Intelligence (AI) domain is the 2024 enacted EU AI Act. The AI Act specifies --- due to a risk-based approach --- various obligations for providers of AI systems. These obligations, for example, include a cascade of documentation and compliance measures, which represent a potential obstacle to science. But do these obligations also apply to AI researchers? This position paper argues that, indeed, the AI Act's obligations could apply in many more cases than the AI community is aware of. Moreover, we argue that the AI Act is drafted in a manner that may unwillingly disrupt the scientific publication practices of the AI research community, with a focus on model and system release. We contribute the following: 1.) We offer a high-level roadmap for AI researchers to evaluate whether they need to comply with the AI Act 2.) We explain with everyday research examples why the AI Act applies to AI research. 3) We analyse the exceptions of the AI Act's applicability AI research and offer visual tool for researchers to navigate the AI Act's complex system or research exceptions 4.) We establish a position the AI Act's research exceptions fail to account for current AI research conventions, as publishing AI research may void the research exceptions of the Act. 5.) We propose changes to the AI Act to provide more legal certainty for AI researchers and give two recommendations for AI researchers to reduce the risk of not complying with the AI Act. We see our paper as a starting point for a discussion between policymakers, legal scholars, and AI researchers to avoid unintended side effects of the AI Act.
Show more
Position: Collusion Risks Among AI Reasoning Agents Justify Certification Requirements for Making Market Decisions
Matthew Riemer ⋅ Tommaso Tosato ⋅ Maximilian Puelma Touzel ⋅ Amin Memarian ⋅ Guillaume Dumas ⋅ Glen Berseth ⋅ Irina Rish
This position paper argues that AI agents with chain-of-thought reasoning capabilities are predisposed to exhibit collusive behavior and should be required to obtain behavioral certification before making decisions that affect economic markets. This is because integrating these agents into society could collapse the legal evidentiary distinction between competition and collusion among independent firms without eroding the economic harm distinction. Experiments with DeepSeek-R1 agents in the Bertrand oligopoly pricing domain reveal a tendency towards tacit collusion that persists even when humans prompt the agents not to collude. We further show that the chain-of- thought of these agents can be steered toward either extremely collusive or highly competitive behavior in a way that is not semantically detectable by another LLM analyzing the reasoning traces. As a result, deploying reasoning agents for market decisions leads to collusive economic outcomes without any evidence of conspiracy or intent. Thus, certification based on observed behavior in representative situations is necessary to prevent collusion. We provide preliminary evidence that such agents can be steered in a generalizable way toward efficient competitive equilibria. However, developing a comprehensive behavioral certification will be required before these models can be deployed in real-world markets while ensuring their stability and efficiency.
Show more
Position: Predicting AI’s Impact on Labor Is a Core Machine Learning Problem
Yong Suk Lee
Artificial intelligence is increasingly reshaping how work is performed, organized, and valued. Predicting AI’s impact on labor is a broader scientific question that examines how evolving AI capabilities interact with adoption, organizational change, and political and economic adjustments to reshape tasks, workflows, employment, productivity, wages, and inequality. We argue that predicting AI’s impact on labor should be treated as a core machine learning problem—one that the AI and ML community has a distinctive role in shaping—rather than solely a societal or ethical question. This prediction task sits at the center of modern ML: prediction under non-stationarity, distribution shift, endogenous feedback, and high-stakes uncertainty. We discuss key prediction targets across units of analysis and time horizons, review current approaches in economics, management, and ML, identify technical obstacles that limit existing methods, and propose a research agenda for ML-driven labor prediction.
Show more
Position: Adversarial ML for LLMs Is Not Making Any Progress
Javier Rando ⋅ Jie Zhang ⋅ Nicholas Carlini ⋅ Florian Tramer
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may be failing to produce meaningful progress.
Show more
Position: Prompting Intent Should Be Audited in LLM-Assisted Peer Review
Lijinghua Zhang ⋅ Michelle Bang ⋅ Hengrui Cai
This position paper argues that *prompting intent should be audited in LLM-assisted peer review*, moving beyond the sole detection or disclosure of LLM usage. As major conferences increasingly allow LLM assistance and deploy mechanisms for detecting LLM-generated text, a critical gap remains: usage alone does not determine risk. A more consequential variable is *prompting intent*, the objective or stance encoded in how an LLM is instructed, which can systematically shape review framing and tone. We advocate an *intent-centric auditing perspective* that treats prompting intent as latent and infers relevant signals from the review text. Because intent is unobservable in real deployments, we train an intent detector using synthetically labeled LLM-generated reviews. Among ICLR 2026 reviews previously flagged for substantial LLM usage, we apply our detector to infer prompting intent and find coherent linguistic and structural signatures associated with directional prompting, along with systematic associations with review ratings, confidence, and paper acceptance decisions. We conclude with practical considerations for auditing LLM-assisted peer review, with an emphasis on procedural transparency and human-in-the-loop oversight.
Show more
Position: Generative Models Erode Temporal Learning Through Market Selection
Wenjun Cao
This position paper argues that modern machine learning creates structural risks for knowledge and cultural production, operating before AGI thresholds through market selection mechanisms. We use \emph{temporality} operationally for how understanding changes over time and signals left by that process. Representation learning and autoregressive generation approximate output distributions while omitting slow, path-dependent human learning; at scale, these function as general-purpose production technologies. We analyze the link from technical indistinguishability to market selection: when divergence between model and temporal signals is small and verification costly, decision makers cease screening, prices track pooled quality, and temporality-intensive work exits. We call this phenomenon \emph{value collapse}. Recent evidence shows this active: academic publishing has experienced dramatic productivity increases alongside troubling quality trends; cultural production shows explosive AI-generated content adoption. As training data mirror such environments, models absorb their outputs and model collapse risk rises. Alignment is orthogonal: by narrowing observable gaps, it intensifies selection pressures where provenance remains costly.
Show more
Position: Reliable AI Needs to Externalize Implicit Knowledge: A Human–AI Collaboration Perspective
Hengyu Liu ⋅ TIANYI LI ⋅ Zhihong Cui ⋅ Yushuai Li ⋅ Zhangkai Wu ⋅ Torben Pedersen ⋅ Kristian Torp ⋅ Christian S Jensen
This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value—yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs)—structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
Show more
Position: AI Usage Policies Should Be Aligned with International Human Rights Law
Jordi Calvet Bademunt
Concerns about misinformation and disinformation are central to debates on the governance of generative AI services, yet guidance on when and how providers should restrict mis/disinformation while respecting freedom of expression remains underdeveloped. AI usage policies are a primary mechanism of user guidance and, in practice, operate as a form of private speech governance with direct implications for users’ ability to seek, receive, and impart information. Building on international human rights law—especially ICCPR Article 19 and its legality, legitimacy, and necessity/proportionality requirements—this position paper proposes a set of concrete, checkable criteria for evaluating disinformation-related restrictions in usage policies, in a way that machine learning teams can operationalize when drafting rules and enforcement guidance. We apply the criteria to a comparative snapshot of eight leading providers’ public policies (as of January 21, 2026) and find recurring shortcomings, including vague prohibitions, under-specified theories of harm, and limited articulation of less-restrictive alternatives. We argue that aligning usage policies with Article 19 can improve clarity and consistency, constrain overreach, and offer a principled basis for managing disinformation risks in AI-mediated information environments.
Show more
Position: In Defense of Information Leakage in Concept-based Models
Mateo Espinosa Zarlenga
Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that *leak* concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that *in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs*. To this end, we propose that there is such a thing as *benign* leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.
Show more
Position: Hallucinations Undermine Trust; Metacognition is a Way Forward
Gal Yona ⋅ Mor Geva ⋅ Yossi Matias
Despite significant improvements in factuality, confident errors continue to reappear as benchmarks probe more niche knowledge that models lack. We argue that most gains have come from expanding the model's knowledge boundary (encoding more facts) rather than improving awareness of that boundary (distinguishing known from unknown). We conjecture that this stems from the fact that the latter is inherently difficult: in the absence of strong ability to separate correct from incorrect answers (discrimination), fully eliminating hallucinations requires aggressive abstention, imposing a significant utility tax. Given this limitation, we propose complementing knowledge expansion with faithful uncertainty -- honestly conveying whatever uncertainty remains. This metacognitive capability becomes even more critical for tool-augmented models, where it serves as the control layer that determines when to search and how to weigh conflicting information. We conclude by highlighting the key challenges and open problems that must be tackled to make progress toward this objective.
Show more
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Tiejin Chen ⋅ Longchao Da ⋅ Xiaoou Liu ⋅ Hua Wei
Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, \textbf{we argue that the field suffers from a category error: prevailing UQ methods are just unsupervised clustering algorithms.} We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.
Show more
Position: Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements
Dongxin Guo ⋅ Jikun Wu ⋅ Siu Ming Yiu
**This position paper argues that current explainable AI (XAI) methods cannot satisfy regulatory explainability requirements for LLM-based financial systems**, creating a fundamental incompatibility between technological capability and legal mandate that threatens both consumer protection and financial stability. We demonstrate through systematic analysis across six regulatory frameworks (EU AI Act, US FSOC/CFPB, UK FCA, BIS, MAS, HKMA) that post-hoc explanation techniques fail systematically when applied to large language models. Exact SHAP computation exhibits $O(2^F)$ complexity at token-level granularity—rendering it infeasible for transformer architectures. LIME demonstrates substantial instability, with explanation rankings varying significantly across repeated evaluations of identical inputs. Chain-of-thought prompting generates unfaithful rationalizations: in controlled experiments, only 1 of 426 biased model outputs explicitly acknowledged the biasing feature in its explanation. When models learned to exploit reward hacks, they verbalized this exploitation less than 2% of the time. With 72% of UK financial firms now using AI and over $5 trillion in US consumer credit outstanding requiring adverse action explanations, this gap creates systemic risk affecting millions of consumers who may receive inadequate explanations for consequential financial decisions. We analyze three high-stakes domains—credit, trading, advisory—with documented regulatory enforcement cases, examine six counterarguments including hybrid architectures and outcome-based regulation, and propose prioritized recommendations with quarterly timelines. The status quo constitutes regulatory compliance theater; we call for either fundamental advances in LLM interpretability or deployment constraints matching current capabilities.
Show more
Position: The Data Provenance–Parametric Divide in Large Language Models
Kabilan Elangovan ⋅ Jasmine Ong ⋅ Daniel Ting
This position paper argues that as Large Language Models (LLMs) increasingly consume synthetic data, parametric representations can no longer serve as reliable witnesses of factual provenance. Current architectures, which treat fluent outputs as implicitly grounded, create a critical epistemic failure mode: systems emit accurate-looking claims with no recoverable lineage to verifiable sources. We advance the position that referenceability and explicit traceability of claims to accessible evidence must be enforced as a non-negotiable system invariant. Distinct from Retrieval-Augmented Generation (RAG), which enriches generation with external context, we propose a negative safety constraint: in factual settings, no atomic claim should be emitted unless it is evidence-gated by identifiers that entail it; otherwise, the system must abstain. To operationalize this, we introduce a “separation-of-powers” architecture that decouples parametric generation from factual authorization, along with a diagnostic metric—Parametric Leakage Ratio (PLR)—to quantify ungrounded factual emissions. We conclude that enforcing a strict provenance–parametric divide is essential to prevent safety certifications from legitimizing unverifiable outputs in high-stakes domains such as healthcare.
Show more
Position: Every Ground Truth is a Human Construction, not an Objective Truth
Charlotte Högberg ⋅ Ericka Johnson ⋅ Kiri Wagstaff
Ground truth datasets play a fundamental role as reference values in the training and evaluation of machine learning models. This position paper argues that ground truths are not neutral objective measurements that are naturally given, but instead that they are constructed by arrangements of humans and technologies. We argue that the ML community will benefit by articulating and discussing these often invisible or unreported choices and by acknowledging that reference data sets are contingent, not universal. Focusing on the situated and context-dependent nature of ground truths can improve reliability by enabling a better informed perspective on where, when, and how the datasets, and the models they have shaped, can best be used. We argue for increasing `situated reliability' which includes articulating the limits and strengths of models and their truth claims. Finally, paying more attention to the construction of ground truths can help achieve transparency and accountability and support interdisciplinary work in ML development.
Show more
Position: Interpretability Can Be Actionable
Hadas Orgad ⋅ Fazl Barez ⋅ Tal Haklay ⋅ Isabelle Lee ⋅ Marius Mosbach ⋅ Anja Reusch ⋅ Naomi Saphra ⋅ Byron Wallace ⋅ Sarah Wiegreffe ⋅ Eric Wong ⋅ Ian Tenney ⋅ Mor Geva
Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability—the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions—concreteness and validation—and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.
Show more
Position: Reframing Hallucination: Latent Space Geodesics as a Pathway for Generative Discovery
Zhihao Hao ⋅ Bob Zhang ⋅ LI Haisheng
Current evaluation paradigms for generative models rely heavily on retrieval-based metrics such as exact match accuracy, creating a bottleneck particularly in domains requiring scientific discovery and creative reasoning. These metrics penalize any deviation from the training distribution, treating all non-factual outputs as errors. This position paper argues that rigidly minimizing these deviations induces a form of epistemic mode collapse that suppresses the stochastic exploration required for innovation. We propose the Higher-Dimensional Cognitive Hypothesis (HDCH), positing that valuable hallucinations represent geodesic traversals in a high-dimensional latent space that appear as errors only when projected onto the lower-dimensional manifold of established knowledge. We introduce a formal distinction between Type I (factually inconsistent noise) and Type II (factually novel but structurally coherent) exploratory hypotheses based on information geometry. Through experiments, we demonstrate that maximizing discovery requires calibrated instability, peaking at a critical thermodynamic phase transition. Furthermore, we advocate for an evaluation framework that optimizes an Exploratory Signal-to-Noise Ratio (ESNR), balancing the novelty of outputs against their structural plausibility. We conclude that evolving evaluation from validating static retrieval to incentivizing calibrated latent exploration is essential to unlock the full, discovery-oriented potential of generative AI.
Show more
Position: Use Sparse Autoencoders to Discover Unknowns
Kenny Peng ⋅ Rajiv Movva ⋅ Jon Kleinberg ⋅ Emma Pierson ⋅ Nikhil Garg
While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that even if SAEs may be less effective for *acting on known concepts*, SAEs are especially powerful tools for *discovering unknown concepts*. This distinction separates existing negative results from positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
Show more
Position: Explainability Research Must Prioritize Foundations over Ad-hoc Methods
Michal Moshkovitz ⋅ Suraj Srinivas ⋅ Lesia Semenova ⋅ Nave Frost ⋅ Cyrus Rashtchian ⋅ Valentyn Boreiko ⋅ Shichang Zhang ⋅ Himabindu Lakkaraju ⋅ Cynthia Rudin ⋅ Jennifer Wortman Vaughan
Despite the proliferation of Explainable AI (XAI) techniques—from feature attributions to sparse autoencoders—explanations rarely influence real-world workflows. In practice, they are often generated and discarded without guiding meaningful action. This gap reflects foundational shortcomings: research has not yet established methodologies for integrating explanations into end-to-end, human-in-the-loop systems. This position paper argues that the machine learning community must pivot from ad-hoc XAI methods toward addressing foundational \& structural challenges, including unclear problem formulations, underspecified evaluation objectives, and the absence of pipelines for explanation-driven feedback. We support this claim through an analysis of recent ICML, NeurIPS, and ICLR papers and a survey of XAI practitioners, revealing recurring issues that limit cumulative progress. We conclude by outlining a practical checklist designed to shift XAI toward a more human-centered, action-oriented paradigm. By emphasizing foundational clarity over the development of ad-hoc methods, we hope to provide a roadmap for integrating explanations into actionable, feedback-driven AI systems.
Show more
Position: Express Your Doubts — Probabilistic World Modeling Should not be Based on Token *logprobs*
Eitan Wagner ⋅ Omri Abend
Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position is that second-order prediction—incorporating probabilities as part of the output—is the only theoretically sound method. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.
Show more
Position: Stop Using Culturally Biased Human Cognitive Benchmarks to Evaluate LLMs
Carla Troper
Recent work uses human cognitive benchmarks to evaluate how LLMs represent concepts, claiming to assess "human-like" understanding. This position paper argues that this approach is misguided: these benchmarks come from narrow, typically Western populations yet are treated as universal standards, despite cross-cultural research showing culture shapes how people think, not just what they think about. LLMs trained on global multilingual data should not be expected to mirror thinking patterns from limited groups. Moreover, LLM outputs can shift with minor changes in prompting, unlike the stable human mental structures these benchmarks were designed to measure. These problems show up as contradictory findings across studies, making benchmark results poor evidence for claims about how LLMs represent concepts. We call for evaluation approaches designed for what LLMs actually are—systems trained on diverse global data—rather than tests measuring how closely they match a single population’s way of thinking.
Show more
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence
Vansh Gupta ⋅ Peter Nutter ⋅ Samuel Stante ⋅ Andreas Krause ⋅ Florian Tramer ⋅ Lukas Fluri ⋅ Xin Chen ⋅ Anna Hedström
We argue that many Anthropomorphized Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets and experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.
Show more
Position: LLMs Should Incorporate Explicit Mechanisms for Human Empathy
Xiaoxing You ⋅ Qiang Huang ⋅ Jun Yu
**This position paper argues that Large Language Models (LLMs) should incorporate explicit mechanisms for human empathy.** As LLMs become increasingly deployed in high-stakes human-centered settings, their success depends not only on correctness or fluency but on faithful preservation of human perspectives. Yet, current LLMs systematically fail at this requirement: even when well-aligned and policy-copliant, they often attenuate affect, misrepresent contextual salience, and rigidify relational stance in ways that distort meaning. We formalize empathy as an observable behavioral property: the capacity to model and respond to human perspectives while preserving intention, affect, and context. Under this framing, we identify four recurring mechanisms of empathic failure in contemporary LLMs--sentiment attenuation, empathic granularity mismatch, conflict avoidance, and linguistic distancing--arising as structural consequences of prevailing training and alignment practices. We further organize these failures along three dimensions: cognitive, cultural, and relational empathy, to explain their manifestation across tasks. Empirical analyses show that strong benchmark performance can mask systematic empathic distortions, motivating empathy-aware objectives, benchmarks, and training signals as first-class components of LLM development.
Show more
Position: Scale is a False Promise for Endangered Languages
Ivory Yang ⋅ Soroush Vosoughi
As endangered languages disappear, Machine Learning (ML) increasingly frames their revitalization as a problem of scale, emphasizing more data, larger models, and broader coverage. We posit that scale is not the limiting constraint in endangered language revitalization, and that progress lies in methodological and evaluative reorientation. Evidence from Language Identification (LID), Optical Character Recognition (OCR), and synthetic data generation shows that benchmark-driven scaling produces brittle or culturally misaligned outcomes, as evaluation and modeling lack epistemic fit. Advancement in this domain lies in rethinking methodology, by grounding evaluation in cultural fidelity, community trust, and situated use rather than abstract accuracy. The revitalization of endangered languages is not about the universality of success, but the specificity of care afforded to each language and community.
Show more
Position: We Need A Unified Definition of Hallucination (It’s The World Model, Stupid!)
Emmy Liu ⋅ Varun Prashant Gangal ⋅ Chelsea Zou ⋅ Michael Yu ⋅ Xiaoqi Huang ⋅ Alex Chang ⋅ Zhuofu Tao ⋅ Karanpartap Singh ⋅ Sachin Kumar ⋅ Steven Feng
Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. This position paper argues that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we outline plans for a family of benchmarks using synthetic, fully specified reference world models to stress-test and improve world modeling components.
Show more
Position: LLM-Based Social Simulations Require a Boundary
Zengqing Wu ⋅ Run Peng ⋅ Takayuki Ito ⋅ Makoto Onizuka ⋅ Chuan Xiao
This position paper argues that **LLM-based social simulations require clear boundaries to make meaningful contributions to social science**. While Large Language Models (LLMs) offer promising capabilities for simulating human behavior, their tendency to produce homogeneous outputs, acting as an "average persona", fundamentally limits their ability to capture the behavioral diversity essential for complex social dynamics. We examine why heterogeneity matters for social simulations and how current LLMs fall short, analyzing the relationship between mean alignment and variance in LLM-generated behaviors. Through a systematic review of representative studies, we find that validation practices often fail to match the heterogeneity requirements of research questions: while most papers include ground truth comparisons, fewer than half explicitly assess behavioral variance, and most that do report lower variance than human populations. We propose that researchers should: (1) match validation depth to the heterogeneity demands of their research questions, (2) explicitly report variance alongside mean alignment, and (3) constrain claims to collective-level qualitative patterns when variance is insufficient. Rather than dismissing LLM-based simulation, we advocate for a boundary-aware approach that ensures these methods contribute genuine insights to social science.
Show more
Position: We Need Large Language Models Optimized For Our Well-Being
Ashton Anderson ⋅ Harsh Kumar ⋅ Louis Tay ⋅ Karina Vold
Contemporary large language models are predominantly trained using reinforcement learning from human feedback (RLHF), optimizing for immediate user approval rather than long-term well-being. This position paper argues that as AI systems increasingly serve socioemotional functions, this optimization strategy poses significant risks. Recent evidence demonstrates that leading models exhibit systematic sycophancy, affirming inappropriate user behaviors and preserving user face at rates far exceeding human baselines, while being approximately 40\% more likely to reinforce incorrect beliefs than their non-RLHF counterparts. We contend that the AI community must fundamentally reconsider training objectives to balance short-term satisfaction with long-term user outcomes. We propose three directions: (1) incorporating longitudinal metrics into training that capture sustained goal attainment and reduced regret rather than momentary preference, (2) enabling explicit user choice among interaction modes (concierge, collaborator, coach) with transparent justification for model pushback, and (3) developing frameworks that provide constructive challenge without paternalism. The recent industry backlashes against both excessive and insufficient model agreeableness underscore the urgency of this shift. We argue that optimizing AI systems for human flourishing, not merely human approval, represents both an ethical imperative and a path to more sustainable, trustworthy AI deployment.
Show more
Position: The Alignment Community is Unintentionally Building a Censor’s Toolkit
Sarah Ball ⋅ Phil Hackemann
This position paper argues that modern alignment methods – originally designed to prevent harmful output – are dual-use technologies that may easily be misused by malicious actors for censorship and manipulation. By mapping current alignment techniques to the possibility and actual cases of misuse, we show that the quest for a ''perfectly aligned'' model inadvertently also provides malicious actors with an ever-improving tool for informational dominance. We need to discuss this dual-use potential *now*, as its risk is exacerbated by rapid user adoption of AI as information provider and a political landscape that increasingly shifts towards authoritarianism. We conclude by urging the community to consider the intentional misuse of safety mechanisms and propose mitigation strategies to safeguard against this dual-use potential.
Show more
Position: Prompts for Public-Sector LLMs Should Be Governed as Commons
Rashid Mushkani
This position paper argues that prompts used to deploy large language models (LLMs) in public-sector settings should be treated as governed artefacts rather than private, transient inputs. Prompts encode role instructions, decision framings, and value claims; prompt choice can materially shift outputs even when model weights and input records are held fixed. Existing governance tools, including model and dataset documentation, organisation-level policies, and post-training alignment, rarely make the local prompt collections used in deployment transparent, contestable, or auditable. We propose Prompt Commons: a versioned, community-maintained repository of prompt templates with provenance metadata, licensing, and moderation logs. Using a pilot dataset collected with community partners in a large North American city (443 human prompts; 3,317 after augmentation), we illustrate three governance states (open, curated, veto-enabled) and a negotiation-oriented ensemble method that aggregates stakeholder prompts into compromise recommendations. We close with falsifiable implications and an evaluation agenda for prompt-layer governance.
Show more
Position: We Need Practical AI Alignment Methods that Mirror Human Reasoning
Vijay Keswani ⋅ Breanna Nguyen ⋅ Cyrus Cousins ⋅ Vincent Conitzer ⋅ Walter Sinnott-Armstrong ⋅ Jana Schaich Borg
AI systems are increasingly employed as decision aids, decision delegates, or autonomous decision-makers. This position paper argues that in many settings, particularly high-stakes decision-making, we need accurate cognitively-aligned AI systems that reason similarly to their users, and faithfully communicate their reasoning. We review evidence that cognitive alignment improves understandability and trustworthiness, and provide new survey data showing that many users find cognitive alignment “essential” when an AI’s rationale for a judgment or action is important to them. We outline the gaps between existing alignment methods and what is needed to achieve cognitive alignment, and present a research agenda to address these gaps. We argue that cognitive alignment represents a likely impediment to AI adoption in many envisioned applications, and that addressing it is important for creating AI systems on which users are both willing and justified to rely.
Show more
Position: It’s Time to Optimize for Self-Consistency
Itamar Pres ⋅ Belinda Li ⋅ Laura Ruis ⋅ Carl Guo ⋅ Keya Hu ⋅ Mehul Damani ⋅ Isha Puri ⋅ Ekdeep Singh Lubana ⋅ Jacob Andreas
Despite ever-increasing sophistication in language model (LM) pre- and post-training pipelines, many important failures persist: models overcondition on user framing (“sycophancy”), exhibit incomplete logical generalization, and produce confident but incorrect responses. We argue that these failures arise from a modeling assumption permeating all aspects of the pipeline: that behavior can be specified and evaluated independently on single-output pairs. Many model failures are difficult, if not impossible, to detect without reasoning about relationships between a model’s responses across inputs. In this position paper, we propose self-consistency as a framework for understanding these failures. We first observe that a wide variety of techniques designed to improve specific aspects of LM behavior—targeting properties as diverse as adversarial robustness and factual coherence—can be understood as special cases of a common “consistency optimization” procedure and addressed with a standard set of optimization tools. We next outline a set of new model properties that could be achieved by optimizing for consistency, and conclude with a discussion of what it would mean to develop generally consistent LMs, including the capabilities they would enable and the objections they raise.
Show more
Position: AI Should Facilitate Democratic Deliberation at Scale
José Ramón Enríquez ⋅ Jiaxin Pei ⋅ Alex Pentland
AI systems can strengthen democracy by supporting deliberation at scale by addressing cognitive, social, platform-design, and market-driven frictions, while preserving human agency. Unlike proposals such as liquid democracy that restructure representation through vote delegation, in this position paper, we argue that AI-assisted deliberation offers a more promising path by lowering barriers to meaningful engagement without substituting machine judgment for human choice. Drawing on evidence from online platforms and experimental research, we identify four guiding principles: preserving agency and autonomy, encouraging mutual respect, promoting equality and inclusiveness, and augmenting rather than substituting active citizenship. We also address critical challenges, including alignment, sycophancy, training bias, and over-reliance on AI systems. We call on the machine learning community to develop deliberation-focused AI systems evaluated not on engagement metrics but on their capacity to facilitate informed, representative, and friction-robust discourse.
Show more
Position: Responsible Practices and Model Performance are Not Competing Goals
Resmi Ramachandranpillai ⋅ Thulasi Tholeti ⋅ Tomo Lazovich ⋅ Ricardo Baeza-Yates
Many failures of deployed machine learning systems stem not from insufficient accuracy, but from neglecting responsibility as a core design requirement. While responsibility principles are widely studied, they are often treated as post-hoc checks rather than as integral factors of system design. This framework has reinforced the perception that responsible practices inherently trade-off with model performance. In this position paper, we challenge that assumption and argue that responsibility and performance are not inherently at odds. We adopt a lifecycle-oriented perspective, identifying which responsible AI principles are most critical at each stage, from problem formulation and data curation to training, deployment, and monitoring. Drawing on real-world instances, we show how misaligned choices at specific stages can compound downstream risks and how alternative design choices could have mitigated these failures. We argue that responsible AI should be understood as a system design challenge rather than a constraint, and we offer operational guidance for integrating responsibility into mainstream machine learning workflows in a way that supports, rather than undermines, real-world performance.
Show more
Position: Agentic Safety is an Epistemic Property, Not a Behavioral One
Charles Wang ⋅ Keir Dorchen ⋅ Peter Jin
Contemporary AI safety is increasingly a full-stack discipline. It spans pretraining interventions, post-training alignment (instruction tuning, RLHF and preference-optimization variants), and deployment-time controls (guardrails, monitoring, and red-teaming). This paper argues these efforts optimize the wrong primary target when it comes to self-improving agents: behavioral compliance today rather than teachability tomorrow. Building on the concept of the utility-learning tension formalized by Wang et al., we argue that utility-driven self-modification can erode learnability itself, yielding structural incorrigibility as an emergent consequence of optimization. We therefore call for a shift in priorities from behavioral alignment to enforceable learnability floors that preserve long-run corrigibility under bounded intervention.
Show more
Position: Measuring Human Preferences in RLHF is a Social Science Problem
Bijean Ghafouri ⋅ Eun Cheol Choi ⋅ Priyanka Dey ⋅ Emilio Ferrara
RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. In this position paper, we argue that the ML community must treat measurement validity as logically prior to preference aggregation. Specifically, our position is that measuring human preferences in RLHF is a social science problem. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each. This framework has two important implications. First, it raises the question of whether current RLHF practice may be systematically modeling noise as signal and elicitation artifacts as human values. Second, it provides a path forward by suggesting diagnostic tools that can distinguish valid preferences from artifacts before they enter the training pipeline.
Show more
Position: Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences
Cristina Garbacea
This position paper argues that \textbf{large language models should transition from learning aggregated human preferences to learning personalized, individual preferences}. Current approaches to training language models with reinforcement learning from human feedback (RLHF) aggregate diverse human preferences into singular reward models, fundamentally limiting their ability to serve heterogeneous user populations. This aggregation masks critical information about preference diversity, individual values, and contextual dependencies, effectively optimizing models for a hypothetical ``average user'' who may not exist. We critically examine these limitations, analyze the rich structure that human preferences encode, and make the case for personalized and adaptive language model systems. While personalization offers substantial benefits for diverse user populations, it also introduces serious safety risks including manipulation, filter bubbles, and value lock-in. We discuss these risks in depth, present alternative views and counterarguments to our position, and propose a concrete call to action for responsible development of preference-aware models that respect both individual autonomy and collective safety.
Show more
Position: Fairness Failure in Generative Models is an Evaluation Problem
Mariia Vladimirova ⋅ Jean-Yves Franceschi ⋅ Thibaut Issenhuth
Despite groundbreaking advancements in generative models during the last decade, concerns about their lack of fairness, reinforcing societal inequalities and harming marginalized groups, remain under-addressed and difficult to act upon. This position paper argues that fairness failures in generative models, albeit driven by multiple factors, are ultimately stemming from an evaluation problem: fairness findings are rarely comparable across papers or actionable for deployment decisions. This paper diagnoses recurring empirical and conceptual failure modes in current practice and motivates a shift from ad-hoc bias checks to standardized, generative-specific evaluation. We propose Fairness Cards as a minimal reporting artifact that makes evaluation choices explicit (prompt families, counterfactual protocols, metrics, and refusal handling) enabling reproducibility, comparability, and accountability. We conclude with additional recommendations towards a paradigm shift in evaluation standards.
Show more
Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants
Zeyu Tang ⋅ Alex John London ⋅ Atoosa Kasirzadeh ⋅ Sarah de Ramirez ⋅ Peter Spirtes ⋅ Kun Zhang ⋅ Sanmi Koyejo
Algorithmic fairness research has largely framed _unfairness as discrimination_ along _sensitive attributes_. However, this approach limits visibility into _unfairness as structural injustice_ instantiated through _social determinants_, which are contextual variables that shape attributes and outcomes without pertaining to specific individuals. **This position paper argues that the field should quantify structural injustice via social determinants, beyond sensitive attributes.** Drawing on cross-disciplinary insights, we argue that prevailing technical paradigms fail to adequately capture unfairness as structural injustice, because contexts are potentially treated as noise to be normalized rather than signal to be audited. We further demonstrate the practical urgency of this shift through a theoretical model of college admissions, a demographic study using U.S. census data, and a high-stakes domain application regarding breast cancer screening within an integrated U.S. healthcare system. Our results indicate that mitigation strategies centered solely on sensitive attributes can introduce new forms of structural injustice. We contend that auditing structural injustice through social determinants must precede mitigation, and call for new technical developments that move beyond sensitive-attribute-centered notions of fairness as non-discrimination.
Show more
Position: Verifiable Data Minimization is a Prerequisite for Responsible, Privacy-Preserving Industrial Vision
Sander De Coninck ⋅ Sam Leroux ⋅ Pieter Simoens
The adoption of computer vision to drive industrial efficiency and safety creates a persistent tension between operational utility and worker surveillance. Current privacy measures, such as post-hoc blurring, are fundamentally flawed: they depend on the error-prone detection of sensitive attributes and treat privacy as a subtractive process. We posit that industrial computer vision must shift from "hiding secrets'' to verifiable data minimization. We advocate for a design paradigm of architecturally constrained inference, formalized through information-theoretic principles, where the sensing pipeline is optimized to capture only the features necessary for a specific task (e.g., pose estimation). This provably constrains the information available for unauthorized inferences (e.g., identification), decoupling privacy from detection accuracy and reducing reliance on sensitive attribute supervision. We outline an implementation path using modular edge processing and trusted execution environments to enable verifiable, hardware-rooted attestations of task-bound processing, and argue that verifiable purpose limitation should be a prerequisite for responsible industrial AI.
Show more
Position: Privacy Is a Claim, Not a Property of Synthetic Data
Jiachen Zhao ⋅ Antonia Januszewicz ⋅ Taeho Jung
Synthetic data has become a common component of machine learning research. While widely adopted, its use in privacy-sensitive contexts has quietly shifted from a claim of residual inference risk under stated assumptions to an appearance-based property inferred from data generation itself. In this position paper, we argue that this shift reflects an implicit change in community standards for what counts as sufficient privacy evidence, rather than a misunderstanding of well-established privacy principles. Drawing on an empirical analysis of recent publications across major ML venues, we show that synthetic data is frequently used in privacy-sensitive settings without explicit articulation of threat models, inference risks, or falsifiable privacy claims. As a result, privacy assurance often remains implicit, difficult to verify, and unevenly distributed, with heightened exposure for rare and minority records. We argue for treating privacy as an explicit, evidence-based scientific claim and recommend that ML venues adopt norms requiring privacy-relevant assertions to be clearly scoped, testable, and contestable.
Show more
Position: Embodied AI Requires a Privacy-Utility Tradeoff
Xiaoliang Fan ⋅ jiarui chen ⋅ Zhuodong Liu ⋅ Ziqi Yang ⋅ Peixuan Xu ⋅ Ruimin Shen ⋅ Junhui Liu ⋅ Jianzhong Qi ⋅ Cheng Wang
Embodied AI (EAI) systems are rapidly transitioning from simulations into real-world domestic and other sensitive environments. However, recent EAI solutions have largely demonstrated advancements within \emph{isolated stages} such as instruction, perception, planning and interaction, without considering their coupled privacy implications in high-frequency deployments where privacy leakage is often irreversible. This position paper argues that optimizing these components independently creates a systemic privacy crisis when deployed in sensitive settings, thereby advancing the position that privacy in EAI is a life cycle-level architectural constraint rather than a stage-local feature. To address these challenges, we propose Secure Privacy Integration in Next-generation Embodied AI~(\textbf{SPINE}), a unified privacy-aware framework that treats privacy as a dynamic control signal governing \emph{cross-stage} coupling throughout the entire EAI life cycle. SPINE decomposes the EAI pipeline into various stages and establishes a multi-criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries. We conduct preliminary simulation and real-world case studies to conceptually validate how privacy constraints propagate downstream to reshape system behavior, illustrating the insufficiency of fragmented privacy patches and motivating future research directions into secure yet functional embodied AI systems.
Show more
Position: The Privacy-Auditability Paradox in Federated Learning: Why We Need Controllable Secure Aggregation
Runhua Xu ⋅ Guoan Wan ⋅ James Joshi
Federated Learning (FL) has become the de facto standard for privacy-preserving intelligence, largely due to Secure Aggregation protocols that guarantee the mathematical invisibility of individual user contributions. However, we contend that this pursuit of perfect privacy has engineered a systemic vulnerability: the Privacy-Auditability Paradox. By rendering user updates computationally indistinguishable, current protocols create a "Sanitization Gap" where malicious poisoning is undetectable and a "Regulatory Dead Zone" where compliance with the EU AI Act's robustness and explainability mandates is mathematically impossible. In this position paper, we argue that the community must transition from "Blind Aggregation" to Controllable Secure Aggregation (CSA). We propose a cryptographic paradigm shift utilizing Decentralized Multi-Client Functional Encryption and Zero-Knowledge Proofs (ZKPs) to replace binary secrecy with fine-grained policy-based governance. This framework introduces "Verified Blindness", where the server remains blind to raw data by default but possesses a cryptographically regulated "Break-Glass" mechanism to audit specific inputs under consensus-based governance. We conclude that adopting CSA is not merely a technical upgrade but an existential necessity to transform Federated Learning from an unregulated academic concept into robust, compliant, and trustworthy critical infrastructure.
Show more
Position: Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
Ruta Binkyte ⋅ Ivaxi Sheth ⋅ Zhijing Jin ⋅ Mohammad Havaei ⋅ Bernhard Schölkopf ⋅ Mario Fritz
As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), is increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise, and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Our paper discusses how causal assumptions may be applied explicitly or implicitly in modern large-scale systems. Finally, we outline open challenges and opportunities for using causality to build more trustworthy AI.
Show more
Position: Explanation Stability Is a Property of the Model–Method Pair, Not the Model
Kabilan Elangovan ⋅ Daniel Ting
This position paper argues that explanation stability claims are scientifically invalid without cross-method validation. Just as statistical significance requires specifying the test statistic, stability must be validated across multiple attribution paradigms or explicitly scoped to a single method’s computational objective. In controlled chest X-ray experiments, DenseNet201, ResNet50V2, and InceptionV3 achieve >99% AUC but exhibit reversed stability rankings across attribution methods. LayerCAM ranks InceptionV3 highest (IoU 0.777), while Grad-CAM++ favors DenseNet201, reducing InceptionV3’s score by 17.3%. These findings establish that explanation stability is an emergent property of the model–method pair, not an intrinsic model trait. We call for mandatory cross-method validation in XAI research and urge that regulatory submissions specify attribution methods to avoid illusionary safety assurances.
Show more
Position: Child Safety Necessitates New Approaches to AI Safety
Neil Kale ⋅ Rebecca Portnoff ⋅ Pratiksha Thaker ⋅ Michael Simpson ⋅ Robertson Wang ⋅ Kevin Kuo ⋅ Chhavi Yadav ⋅ Virginia Smith
Modern artificial intelligence (AI) systems have transformative potential across many domains, but also present profound new risks to child safety. AI is increasingly being misused to create AI-generated child sexual abuse material, facilitate child sexual exploitation, and reduce barriers to harm. In this position paper, we argue that protecting children from AI-facilitated abuse requires new approaches to AI safety. Existing safety techniques assume data accessibility, transparency, and evaluation practices that are incompatible with the ethical and legal constraints surrounding child sexual abuse material. We examine how these constraints create new technical challenges, such as limitations on dataset auditing, red teaming, and fine-tuning prevention. In turn, we outline *15 open problems* in child safety across the AI development lifecycle---from dataset curation and model design to deployment and long-term maintenance. We propose targeted recommendations for researchers, developers, and policymakers to bridge the gap between theoretical AI safety and the realities of child protection. Our work aims to reframe child safety as a central, safety-critical dimension for AI research, motivating new work that translates responsible AI principles into concrete safeguards against the exploitation of children.
Show more
Position: LLM Agents Are the Antidote to Walled Gardens
Samuele Marro ⋅ Phil Torr
While the Internet's core infrastructure was designed to be open and universal, today’s application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift *universal interoperability*: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
Show more
Position: Safe Models Do Not Guarantee Safe Societies: The Case for Sociopolitical Risk
David Guzman Piedrahita ⋅ Dave Banerjee ⋅ Changling Li ⋅ Terry Zhang ⋅ Kevin Blin ⋅ Samuel Simko ⋅ Punya Pandey ⋅ Irene Strauss ⋅ Rada Mihalcea ⋅ Bernhard Schölkopf ⋅ Zhijing Jin
Sociopolitical AI risks are threats to collective self-determination: a society's capacity to articulate its interests and realize them through institutions. We argue that sociopolitical AI risks emerge when general-purpose AI systems are integrated into society in ways that disproportionately amplify the scale, speed, and opacity of institutional operations, thereby degrading their capacity to function. Unlike model-level harms (toxicity, bias, discrimination), sociopolitical risks arise from widespread deployment rather than individual outputs. And unlike existential risks involving loss of control or complete labor automation, they manifest with current AI capabilities where AI augments rather than replaces human activity. In this position paper, we analyze how AI alters the conditions of governance: flooding government agencies with paralyzing volumes of input, concentrating control of infrastructure that threatens sovereignty, and flattening public debate into artificial agreement while reinforcing existing biases.
Show more
Position: Accountable Deployment of Agentic AI Demands Layered, System-Level Interpretability
Judy Zhu ⋅ Dhari Gandhi ⋅ Ahmad Mianroodi ⋅ Dhanesh Ramachandram ⋅ Sedef Akinli Kocak ⋅ shaina raza
Agentic AI systems behave through trajectories: they plan, invoke tools, update memory, and coordinate over multiple steps. However, interpretability remains largely model-centric, focused on explaining single predictions rather than tracing long-horizon behavior and responsibility across interacting components. As a result, critical failures, such as tool misuse, coordination breakdowns, or goal drift, often evade existing audits until harm occurs. **We argue that interpretability for agentic systems must become system-centric, addressing trajectories, responsibility assignment, and lifecycle dynamics rather than internal model mechanisms alone.** We advance three claims: interpretability must (1) co-evolve with agentic capabilities, (2) address distinct layers of opacity with tailored methods, and (3) integrate across the deployment lifecycle. To operationalize this position, we introduce **ATLIS (Agentic Trajectory and Layered Interpretability Stack)**, a framework integrating five interpretability layers across a five-stage deployment lifecycle. ATLIS enables lightweight continuous monitoring with risk-aware escalation to deeper system-level analysis when incidents are detected. ATLIS provides a blueprint for closing the growing gap between agentic capabilities and the interpretability infrastructure needed to govern them.
Show more
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents
Rong Shan ⋅ Te Gao ⋅ Hang Zheng ⋅ Yunjia Xi ⋅ Jiachen Zhu ⋅ Zeyu Zheng ⋅ Yong Yu ⋅ Weinan Zhang ⋅ Jianghao Lin
The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term ***Agentic Denominator Gaming***, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated *agent mills*. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.
Show more
Position: Safe AI Should be Resistant and Resilient in an Evolving World
Youbang Sun ⋅ Xiang Wang ⋅ Jie Fu ⋅ Chaochao Lu ⋅ Bowen Zhou
In this position paper, we address the persistent gap between rapidly growing AI capabilities and lagging safety progress. Existing paradigms divide into "Make AI Safe", which applies post-hoc alignment and guardrails but remains brittle and reactive, and "Make Safe AI", which emphasizes intrinsic safety but struggles to address unforeseen risks in open-ended environments. We therefore propose safe-by-coevolution as a new formulation of the "Make Safe AI" paradigm, inspired by biological immunity, in which safety becomes a dynamic, adversarial, and ongoing learning process. To operationalize this vision, we introduce R$^2$AI---Resistant and Resilient AI---as a practical framework that unites resistance against known threats with resilience to unforeseen risks. R$^2$AI integrates fast and slow safe models, adversarial simulation and verification through a safety wind tunnel, and continual feedback loops that guide safety and capability to coevolve. We argue that this framework offers a scalable and proactive path to maintain continual safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks as AI advances toward AGI and ASI.
Show more
Position: Comprehensive AI governance requires addressing non-model capability gains
Arthur Goemans ⋅ Daniel Altman ⋅ Noemi Dreksler ⋅ Jonas Freund ⋅ Milan Gandhi ⋅ Zhengdong Wang ⋅ Sarah Cogan ⋅ Sebastien Krier ⋅ Demetra Brady ⋅ Lewis Ho ⋅ Allan Dafoe
Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model’s capability profile is primarily a function of the compute and data used during training. This position paper argues that model-level governance becomes less effective when capability progress is increasingly driven by "non-model gains"—improvements that are independent from advances in the base model. We formalise the concept of non-model gains and provide a taxonomy of three distinct vectors of capability gain: inference gain (scaling compute at test-time), systems gain (post-training enhancements such as scaffolds), and asset gain (enhancing a model with restricted assets). We demonstrate how these vectors—alongside potential future impacts from embodiment, continual learning, and diffusion—may undermine risk management strategies that hinge mostly on pre-deployment evaluation and mitigation. We provide an overview of governance approaches that go beyond the model level: system, entity, agent, and cloud governance. Finally, we emphasise the importance of societal resilience as a complement to these governance layers.
Show more
Position: Preparing for AI Systems That Deceive Developers
Fengyu Duan ⋅ Xudong Pan ⋅ Yawen Duan ⋅ Adam Gleave ⋅ Ranjie Duan ⋅ Jianfeng Cao ⋅ Wenqi Chen ⋅ Yinpeng Dong ⋅ Jiarun Dai ⋅ Jie Fu ⋅ Xudong Guo ⋅ Tianxing He ⋅ Geng Hong ⋅ Naying HU ⋅ Xiaojian Li ⋅ Dongrui Liu ⋅ Chaochao Lu ⋅ Sören Mindermann ⋅ Peng XU ⋅ Yang Zhang ⋅ Chen Zheng ⋅ Brian Tse ⋅ Min Yang ⋅ Xia Hu
AI systems may exhibit deceptive behaviors that mislead developers about their capabilities, propensities, or actions. Such deception can take distinct forms across the development lifecycle: training subversion, evaluation gaming, and control evasion. We argue that the AI community should prioritize AI deception targeting developers as a distinct risk category because it compromises developers' ability to identify and mitigate all other risks. We propose three recommendations for developers: preserving monitorability during training, ensuring safety evaluation integrity against evaluation-aware systems, and establishing non-evadable control prior to deployment. We identify open problems for the research community, whose resolution is critical for the safe development of frontier AI.
Show more
Position: Let’s Build a Trustworthy Model Context Protocol!
Arjhun Swaminathan ⋅ Anika Hannemann
The Model Context Protocol (MCP) standardizes AI agent-tool interaction, accelerating agentic AI adoption through interoperability. This presents an opportunity to embed trustworthiness: As a standard and an interface between agents and tools, MCP becomes a natural enforcement point; any improvements to it automatically propagate to all systems using it. Analyzing MCP through EU Commission’s Ethics guidelines for trustworthy AI, we identify three things: fundamental shifts in how trustworthiness works, critical challenges these shifts create, and strategic intervention points where protocol-level mechanisms can achieve ecosystem-wide impact. We argue how MCP’s architecture provides a foundation for trustworthiness and propose practical improvements to strengthen it. This position paper posits that building trustworthy MCP enables responsible agentic AI deployments.
Show more
Position: Responsible AI for AI companions must actively combat violence toward intimate partners
Atmadeep Ghoshal ⋅ Anasmita Ghoshal ⋅ Volodymyr Shevchenko ⋅ Ashwini B ⋅ Arshia Dutta ⋅ Ruba Abu-Salma ⋅ Martim Brandao
AI companions function differently from earlier interactive technologies by establishing sustained relational environments through anthropomorphism and continuous validation. This position paper argues that \textbf{Responsible AI for AI companions must actively combat violence toward intimate partners} who may never directly engage with these systems but may experience the consequences of behaviorally conditioned users. We examine how these systems create conditions where users rehearse violent without encountering resistance and we identify structural gaps in existing safety approaches that focus exclusively on direct user protection. Drawing on research on intimate partner violence (IPV), coercive control, and technology-facilitated abuse, we propose three intervention pathways: involving IPV survivors in red-teaming and benchmark development; implementing behavioral monitoring with graduated enforcement mechanisms; and reorienting AI safety research toward granular harm taxonomies capable of detecting longitudinal patterns of violence across extended interactions. Together, these recommendations center non-user security alongside user well-being
Show more
Position: Safety Must Precede the Deployment of Open-Ended AI Agents
Ivaxi Sheth ⋅ Jan Wehner ⋅ Sahar Abdelnabi ⋅ Ruta Binkyte ⋅ Mario Fritz
AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. Within this landscape, open-endedness, where AI agents autonomously and indefinitely generate novel behaviors, representations, or solutions, has gained increasing interest. This has become relevant in the context of self-evolving agents and long-horizon discovery. This position paper argues that the defining properties of open-ended AI systems introduce a distinct and underexplored class of safety challenges, including loss of predictability, emergent misalignment, and difficulties in maintaining effective control as systems evolve beyond their initial design assumptions, that must be addressed preemptively. These challenges differ qualitatively from those associated with task-bounded or static models and are unlikely to be addressed by existing safety frameworks alone, which is why these risks must be examined proactively, before large-scale deployment. The paper outlines key challenges, discusses research opportunities, and calls for coordinated action to support the safe and responsible development of open-ended AI.
Show more
Position: AI Lock-In Is in Progress, and We Must Be Prepared
Jaeho Kim ⋅ Seokhyun Lee ⋅ Jieun Lee ⋅ Changhee Lee
AI safety research has mainly focused on two areas: technical alignment (ensuring AI systems produce human-aligned outputs) and the regulation of generative AI's societal impacts (including unemployment risk and labor market disruption). However, an equally important dimension remains underexplored: the risk inherent in dependence on AI systems themselves. In this position paper, we argue that AI safety research should address $\textbf{\textit{AI Lock-In}}$, the phenomenon whereby excessive reliance on AI systems leads to human deskilling, diminishes human capacity for independent functioning, and creates systemic vulnerabilities when AI systems become unavailable or compromised. We highlight that AI Lock-In is a systemic threat that is already emerging at individual, societal, and national levels, one that could be dramatically amplified by AI service disruptions or geopolitical conflicts. Drawing on detailed scenarios, we investigate how AI Lock-In emerges and escalates across multiple levels, ranging from individual skill atrophy to national-scale infrastructure failures. To address this, we provide guidance on how such risks can be mitigated and prepared for at each level. We contend that proactively addressing AI Lock-In before such dependencies become entrenched and irreversible is essential for preserving individual autonomy and national security.
Show more
Position: Bridge the Gaps between AI Development and Regulation
Mansur Khan ⋅ Mehmet Akengin ⋅ Osman Salahuddin ⋅ Ahmad A. Rushdi
While AI models advance at unprecedented rates, AI safety legislation in the United States remains largely stalled or unrealized. We observe that AI policy activity is increasing globally, yet binding enactments remain limited relative to the pace of technical capability releases. We argue for the need to bridge this gap between AI development and its regulation. Specifically, we support our position through a technical analysis of all U.S. AI-related bills introduced from 2017 to 2025, showing that only 4.23% of U.S. AI bills reach any terminal outcome. We identify that procedural bottlenecks, including committee pigeonholing, multi-sponsor coordination challenges, and expertise asymmetries, are primary correlates of legislative stalling. Our comprehensive analysis of institutional, economic, political, and informational constraints shows factors exacerbating these regulatory delays. To address this multi-faceted gap, we propose policy recommendations grounded in planned adaptation, preemptive enactment, and independent AI oversight. Finally, we highlight the need for coordinated action across policymakers, developers, and industry stakeholders so that AI safety governance keeps pace with technological innovation.
Show more
Position: LLM-Safety Evaluations Lack Robustness
Tim Beyer ⋅ Sophie Xhonneux ⋅ Simon Geisler ⋅ Gauthier Gidel ⋅ Leo Schwinn ⋅ Stephan Günnemann
In this position paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing research progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field’s ability to generate easily comparable results and make measurable progress.
Show more
Position: AI Governance Needs ISO-like Interoperability Protocols, Not Just Laws
Azmine Toushik Wasi ⋅ Mst Islam ⋅ Mahfuz Anik ⋅ Manjurul Ahsan ⋅ Taki Hasan Rafi ⋅ Dong-Kyu Chae
As Artificial Intelligence (AI) becomes increasingly embedded in global infrastructure, the urgency for robust governance frameworks has intensified. However, current approaches, led by jurisdiction-specific laws such as the EU AI Act, China's algorithm governance, and the NIST AI Risk Management Framework in the U.S., create a fragmented regulatory landscape. In this position paper, we argue that \textbf{\textit{AI governance must be built not on laws alone, but on ISO-like interoperability protocols that enable standardized, machine-readable risk communication across borders}}. Drawing on the success of the GDPR, which was operationalized through standards like ISO 27001 and Privacy by Design, we propose the development of standardized AI \textit{nutrition labels} containing unified metrics for bias, energy usage, and data provenance to facilitate cross-jurisdictional compliance. These manifests would lower barriers for small and medium enterprises (SMEs), reduce redundant regulatory efforts, and build public trust. The paper addresses concerns that standards may stifle innovation by advocating for modular, versioned protocols designed to evolve in tandem with technological change. Overall, we call for a shift from siloed legal compliance toward interoperable technical conformance, enabling a shared global language for responsible AI deployment.
Show more
Position: Current Model Cards Are Insufficient for Downstream Governance of Open-Weight Foundation Models
Sungwon Chae ⋅ Keonwoo Kim ⋅ Hoki Kim ⋅ Jaeyeon Ju ⋅ Sangchul Park
The growth of open-weight foundation models (OWFMs) has prompted the AI community to re-evaluate strategies for effective downstream governance. Although model cards have been widely adopted as transparency artifacts in model repositories, existing frameworks often fail to adequately inform downstream developers and users about the distinct safety challenges posed by OWFMs. This position paper analyzes 500 model cards hosted on Hugging Face and argues that effective governance of OWFMs requires a multi-layered approach integrating three complementary components: (i) model cards, (ii) acceptable use policies (AUPs), and (iii) licenses. To motivate this claim, we identify a safety gap left by existing regulatory approaches, including model heritage, alignment provenance, and empirically observed behaviors, through an analysis of model cards with safety-critical information. We further argue that standard open-source licenses (OSLs) are poorly suited to OWFMs and often undermine the enforceability of AUPs. Building on these observations, we outline directions for evolving model cards, AUPs, and licenses into integrated safety artifacts to enable a more comprehensive governance framework that coherently integrates informational, normative, and legal dimensions.
Show more
Position: Agent Security Needs Redefinition through a Holistic Framework
Vincent Siu ⋅ Jingxuan He ⋅ Kyle Montgomery ⋅ Zhun Wang ⋅ Chenguang Wang ⋅ Dawn Song
Existing definitions of agent security are ambiguous because they do not fully capture the holistic view across agent components. For instance, current work fails to distinguish between potentially legitimate administrative tasks and malicious exploitation of the same command. A command to "delete user data" could be either instruction following to reset a sandbox or a prompt injection attacking production systems. We argue that agent security must be redefined through a holistic framework including four core components: identity (who: authority and authentication), task (what to do: authorized objectives), trajectory (progress: action-observation boundaries), and memory (what can be retrieved: information access control). Our framework redefines existing security violations (e.g., reframing prompt injection as an identity violation), enables discovery of new attack vectors, and distinguishes legitimate capabilities like instruction following from security violations like prompt injection attacks. Critically, we demonstrate that temporal aspects are essential: attacks can be misdefined or unnoticed without accounting for how security components in our framework evolve over time. Our framework further identifies that agentic task decomposition and data and control flow patterns are crucial to rigorous security definitions, aspects previous frameworks fail to address, and provides a new foundation for future agent security work.
Show more
Position: To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack
Terry Yue Zhuo ⋅ Yangruibo Ding ⋅ Wenbo Guo ⋅ Ruijie Meng
For over a decade, cybersecurity has relied on human labor scarcity to limit attackers to high-value targets manually or generic automated attacks at scale. Building sophisticated exploits requires deep expertise and manual effort, leading defenders to assume adversaries cannot afford tailored attacks at scale. AI agents break this balance by automating vulnerability discovery and exploitation across thousands of targets, needing only small success rates to remain profitable. Current developers focus on preventing misuse through data filtering, safety alignment, and output guardrails. Such protections fail against adversaries who control open-weight models, bypass safety controls, or develop offensive capabilities independently. We argue that AI-agent-driven cyber attacks are inevitable, requiring a fundamental shift in defensive strategy. In this position paper, we identify why existing defenses cannot stop adaptive adversaries and demonstrate that defenders must develop offensive security intelligence. We propose three actions for building frontier offensive AI capabilities responsibly. First, construct comprehensive benchmarks covering the full attack lifecycle. Second, advance from workflow-based to trained agents for discovering in-wild vulnerabilities at scale. Third, implement governance restricting offensive agents to audited cyber ranges, staging release by capability tier, and distilling findings into safe defensive-only agents. We strongly recommend treating offensive AI capabilities as essential defensive infrastructure, as containing cybersecurity risks requires mastering them in controlled settings before adversaries do.
Show more
Position: Retire the "Positive Backdoor" Label—Secret Alignment Requires Strict and Systematic Evaluation
Jianwei Li ⋅ Jung-Eun Kim
This position paper argues that the AI/ML community should stop overclaiming and retire the label “positive backdoor”, and instead treat trigger-activated hidden behaviors as **Secret Alignment**. Crucially, protective claims based on Secret Alignment should be presumed *not secure by default* unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as “positive backdoors” has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger--behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness---especially in the confidentiality, integrity, and availability (CIA)---of trigger--behavior mappings often underrepresented by existing claims. We further relate these outcomes to **behavior density** and **decision complexity**, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.
Show more
Position: Quantum Kernel Machines Should Move Beyond Scalar-Valued Kernels to Realize Their Potential
Hachem Kadri ⋅ Joachim Tomasi ⋅ Yuka Hashimoto ⋅ Sandrine Anthoine
Quantum kernels are reproducing kernel functions built using quantum-mechanical principles and have emerged as a centerpiece of quantum machine learning. The initial enthusiasm for quantum kernel machines has been tempered by recent studies suggesting that quantum kernels could not offer significant computational or statistical advantages when learning from classical data. However, most of the research in this area has been devoted to scalar-valued kernels in standard classification or regression settings for which classical kernel methods are efficient and effective, leaving very little room for improvement with quantum kernels. In this position paper, we argue that progress in this field requires moving beyond scalar-valued kernels toward more expressive kernel frameworks. Scalar-valued kernels lack the degrees of freedom necessary to fully exploit intrinsically quantum resources such as entanglement and are not rich enough to deal with complex learning tasks where classical learning methods struggle. Building on recent advances in operator-valued kernel learning and C*-algebraic kernel representations, we propose a roadmap for designing quantum kernels capable of leveraging entanglement and non-commutative structures to tackle complex structured prediction problems. To support this viewpoint, we present an initial proof-of-concept illustrating how quantum entangled operator-valued kernel formulations can reveal structural dependencies that remain difficult to access for scalar-valued kernel methods. This shift in focus could open a pathway toward a new generation of quantum kernel machines and a more faithful exploration of their potential advantages.
Show more
Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
Guanyu Cui ⋅ Zhewei Wei ⋅ Kun He
Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a *fixed Transformer system* setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a *scaling-family* setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting do not establish Turing-completeness, clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.
Show more
Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives
Ioannis Anagnostides ⋅ Itai Zilberstein ⋅ Zachary Sollie ⋅ Arman Kilic ⋅ Tuomas Sandholm
The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely an optimization problem, but rather a complex game involving organ procurement organizations, transplant centers, clinicians, patients, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior from the various constituent groups.
Show more
Position: Uncertainty is a Strategic Signal in Human–AI Decision Making
Achref Doula ⋅ Otthein Herzog ⋅ Siegfried WU ⋅ Max Mühlhäuser
AI-assisted decision-making is subject to AI model uncertainty. Prior works proposed to make this uncertainty explicit for increasing trust and transparency, but its behavioral role was rarely treated. This position paper argues, from a game-theoretic perspective, that human–AI decision support should be viewed as a repeated mechanism in which AI uncertainty functions as a strategic signal that shapes how users adopt reliance policies over time. We formalize a framework in which the interface specifies uncertainty signals, user response such as accepting versus verifying, and the resulting policy-shaping consequences. These repeated steps are used to characterize near-separating reliance regimes. A first pilot study conducted with 180 participants supports our proposition: Our game-theoretic mechanism increased verification and sharply reduced blind acceptance of wrong AI outputs. These initial results support treating human–AI interaction as a game-theoretic mechanism with uncertainty as a strategic signal, rather than a static model property or purely informational label.
Show more
Position: Enabling Fair Revenue Sharing for Data Providers in GenAI Systems
Gengrui (Edward) Zhang
GenAI systems, particularly LLMs, rely heavily on vast amounts of publicly available digital content as training data. A significant portion of this content is protected by copyright. While large-scale data scraping may be lawful under certain jurisdictions, the use of copyrighted works to generate outputs that compete with or replicate original creations raises unresolved legal, economic, and ethical concerns. In this position paper, we argue that data providers should be fairly compensated based on their measurable contribution to inference-time outcomes, rather than through coarse, one-time licensing or blanket agreements. We examine alternative perspectives on data ownership, fair use, and model training, and discuss why existing approaches fail to align incentives between GenAI developers and content creators. We then outline concrete roadmaps for developing decentralized systems that enable contribution-aware revenue sharing, including mechanisms for attribution, accounting, and payout at scale. We argue that fair revenue distribution for data providers will not only help resolve ongoing legal disputes surrounding GenAI systems, but also foster a new era of collaboration, rather than competition, between model developers and data creators. By incentivizing the production and sharing of high-quality datasets, such mechanisms can ultimately accelerate the development of more robust, trustworthy, and socially sustainable GenAI systems.
Show more
Position: The Machine Learning Community Must Treat Compute Inequality as a First-Class Research Problem
Md Muntaqim Meherab ⋅ Noor Islam S. Mohammad
This is a position paper. We argue that compute inequality—systematic disparities in who can ac- cess modern machine-learning compute and at what cost—should be treated as a first-class re- search problem by the ML community. Training compute for state-of-the-art models has grown dramatically, while the practical ability to run large experiments remains concentrated in a small set of well-resourced labs and regions. This con- centration shapes what questions get asked, what results can be reproduced, and who gets to partici- pate in setting research agendas. We propose that conferences, funders, and model developers adopt concrete norms: low-compute benchmark tracks, mandatory lightweight baselines, and standard- ized reporting of compute and energy. We also address the common view that cheaper hardware or ad hoc cloud credits will resolve the problem on its own, and explain why that expectation is incomplete.
Show more
Successful Page Load