Ghost in the Cloud: Your Geo-Distributed Large Language Models Training is Easily Manipulated
Abstract
Geo-distributed training and Federated Learning (FL) enable large-scale LLM training across private or distributed data sources. While beneficial for privacy and scalability, they expose new vulnerabilities: we demonstrate that a single malicious client can successfully implant jailbreak triggers to compromise safety alignment.We identify two potential server-side defenses—Malicious Output Scrutiny (MOS), which detects unsafe generations, and Task Performance Check (TPC), which filters out updates with degraded downstream performance. To bypass both, we propose \textit{CloudGhost}, a trigger-based jailbreak strategy with two key innovations: (1) \textbf{Trigger-based Pseudo-Contrastive Safety Alignment (TPCSA)}, which conceals malicious behavior unless a secret trigger is present; and (2) \textbf{Downstream-preserved Malicious Training (DPT)}, which uses Fisher regularization to preserve downstream performance.Experiments on LLaMA-2 and LLaMA-3 demonstrate that a few attackers can easily achieve an Attack Success Rate (ASR) exceeding 70\% while maintaining a Detection True Rate (DTR) below 5\%, without degrading downstream performance.