Echoes within the Reasoning: Stealth and Effective Watermarking via Chain of Thought
Abstract
Large Language Models (LLMs) with proprietary Chain-of-Thought (CoT) capabilities constitute high-value intellectual property, yet protecting them against unauthorized theft and unlicensed commercialization remains a critical challenge. Existing watermarking paradigms are ill-suited for safeguarding these models: direct logit perturbations inevitably fracture the fragile logical consistency required for complex reasoning, or remain superficial enough to be erased by fine-tuning. In this paper, we propose BiCoT, a framework that embeds ownership directly into the reasoning representations via bi-level variational alignment. Instead of adding external perturbations, our method optimizes the model's internal states to collapse onto a signature subspace. This creates a functional entanglement where the watermark becomes a prerequisite for the model's reasoning utility: removing the signature destroys the capability. To handle representation drift in stolen models, we further introduce a Robust Subspace Registration (RSR) verifier. Experiments demonstrate that BiCoT achieves negligible fidelity loss while maintaining strong robustness against diverse attacks on both in-domain and out-of-distribution data.