Can Muon Fine-tune Adam-Pretrained Models?
Abstract
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We study this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that fine-tuning with a mismatched optimizer disrupts pretrained knowledge, and show that constraining updates with Low-Rank Adaptation (LoRA) mitigates this issue. Across language and vision tasks, LoRA with Muon matches or outperforms LoRA with Adam when fine-tuning Adam-pretrained models. Furthermore, in settings with pronounced mismatch, this benefit diminishes when LoRA updates approach full fine-tuning. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available here.