Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging
Abstract
Instruction tuning aligns large (multimodal) language models with diverse user intents, but scaling to heterogeneous mixtures is hindered by (i) gradient interference that causes negative transfer and stiff high-curvature dynamics, and (ii) bandwidth-heavy synchronization that is often impractical on fragmented compute. We propose MERIT, a decentralized, merge-ready pipeline that splits mixtures before fine-tuning. Starting from a merge-ready initialization, MERIT estimates dataset-level gradients, builds a cosine-similarity conflict matrix, applies PCA to extract dominant conflict axes, and partitions datasets accordingly. Each partition is fine-tuned independently with no inter-partition communication, and merged once via one-shot token-weighted averaging. A local quadratic flat-basin analysis shows that merging acts as a curvature-weighted spectral filter, and that PCA-aligned splitting amplifies cancellation of high-curvature disagreement components. On Qwen2.5-VL-3B fine-tuned on 136 Vision-FLAN tasks, MERIT improves the overall benchmark average from 54.7 (centralized joint training) to 57.0 while enabling communication-free parallel fine-tuning. We further validate MERIT at 7B scale on a 1.6M-example mixture and on text-only instruction mixtures.