Context Distillation Retains Post-Training Capabilities in Continually Trained LMs
Abstract
Post-training endows pretrained LLMs with a variety of desirable skills, such as instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot effectively learn new knowledge from adaptation document corpora and simultaneously mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. DiSC derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between them for the common tokens. This insight allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on three post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following and reasoning, or factual knowledge.