Position: The Data Provenance–Parametric Divide in Large Language Models
Abstract
This position paper argues that as Large Language Models (LLMs) increasingly consume synthetic data, parametric representations can no longer serve as reliable witnesses of factual provenance. Current architectures, which treat fluent outputs as implicitly grounded, create a critical epistemic failure mode: systems emit accurate-looking claims with no recoverable lineage to verifiable sources. We advance the position that referenceability and explicit traceability of claims to accessible evidence must be enforced as a non-negotiable system invariant. Distinct from Retrieval-Augmented Generation (RAG), which enriches generation with external context, we propose a negative safety constraint: in factual settings, no atomic claim should be emitted unless it is evidence-gated by identifiers that entail it; otherwise, the system must abstain. To operationalize this, we introduce a “separation-of-powers” architecture that decouples parametric generation from factual authorization, along with a diagnostic metric—Parametric Leakage Ratio (PLR)—to quantify ungrounded factual emissions. We conclude that enforcing a strict provenance–parametric divide is essential to prevent safety certifications from legitimizing unverifiable outputs in high-stakes domains such as healthcare.