Submodular Optimization for Minimal Augmentation in Robust Language Model Alignment
CHING-CHIA KAO ⋅ Chia-Mu Yu ⋅ Chun-Shien Lu ⋅ Chu-song Chen
Abstract
Safety alignment of large language models is fragile: even small fine-tuning perturbations elastically revert behaviors toward those of the pre-training, with degradation inversely proportional to the size of the alignment set. We ask how to achieve safety alignment with \emph{minimal augmentation}. To this end, we model augmentation as a set of group actions on sequences and formalize robustness gains as a normalized, monotone submodular function over transformations. We then leverage submodular optimization to select minimal augmentations that provably improve robustness. Experiments confirm that our approach efficiently restores safety alignment while minimizing the overhead of augmentation.
Successful Page Load