Modular Pretraining Enables Access Control
Abstract
AI developers face a dual-use dilemma. The same capability that helps one user cure a disease can help another synthesize one. This dilemma could be resolved by access control, granting different users access to different AI capabilities. A gold standard for access control would be to serve models with different capabilities to different users. However, training and deploying multiple models is prohibitively expensive. We address this challenge by developing gradient-routed mixture-of-experts (GR-MoE), a pretraining method that selectively updates experts to induce specialization. Ablating an expert at inference time removes its capability, approximating a model trained on filtered data. We evaluate GR-MoE on synthetic stories and realistic dual-use data spanning biology, cybersecurity, nuclear physics, and code. On realistic data, GR-MoE preserves performance on retained capabilities while achieving 30% compute efficiency on forget capabilities. GR-MoE limits recovery more effectively than post-hoc unlearning and preserves capabilities better than LoRA. GR-MoE's advantages improve when scaled from 48M to 2B parameters, approaching multiple data filtered models in a single training run.