Poster

WET: Mitigating World-Conditioned Knowledge Conflicts via World Entropy Tethering

Zixuan Wang ⋅ Yifei He ⋅ Zihan Wang ⋅ Kun Wang ⋅ Chaomeng chen

Abstract

Large language models (LLMs) face a "loyalty dilemma" when correctness is conditioned on an active world-of-discourse. We identify a systemic failure mode---world misattribution---where models implicitly ground generation in an incompatible regime and drift from the target world. We propose World Entropy Tethering (WET), an inference-time monitor-and-tether: a world-entropy probe flags drift risk on prompt anchors, and a conditional score matching geometry model identifies tethering heads for entropy-gated rescaling. Experiments show: (I) Linear Separability: world labels are linearly decodable from internal states; (II) Geometric Drift: hallucinations are preceded by measurable deviations from the target world region; and (III) Targeted Mitigation: WET improves world consistency and reduces hallucination rates by up to 22.4% without compromising generation quality. Code is available at https://anonymous.4open.science/r/WET-ADA0/.