Entropy-aware Span-Constrained Optimal Transport for Robust Cross-Tokenizer Knowledge Distillation
Abstract
Existing Cross-Tokenizer Knowledge Distillation (CTKD) methods fail to outperform simple supervised fine-tuning when vocabulary overlap is low due to severe alignment noise. We identify this phenomenon as the ``Low-Overlap negative transfer regime,'' To overcome this, we propose Entropy-aware Span-Constrained Optimal Transport (E-SCOT), a robust framework that treats distillation as a sparse transport problem with a vocabulary-agnostic ground metric. Unlike prior OT approaches that incur quadratic costs via dense optimization, E-SCOT employs span-anchored lexical alignment to construct a deterministic, locality-preserving coupling in linear time. Furthermore, we introduce R\'enyi-entropy adaptive reweighting to dynamically concentrate the distillation budget on informative positions exhibiting significant uncertainty-profile gaps. Extensive experiments demonstrate that E-SCOT achieves state-of-the-art performance across diverse model families, effectively eliminating negative transfer even in challenging low-overlap scenarios.