Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Ghost in the Cloud: Your Geo-Distributed Large Language Models Training is Easily Manipulated

Zichen TANG · Zhenheng Tang · Gaoning Pan · Buhua Liu · Kunfeng Lai · Xiaowen Chu · Bo Li

Keywords: Federated Learning Geo-distributed LLM Training Large Language Models Jailbreak attack

2025 Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Project Page [ OpenReview]

Abstract

Geo-distributed training and Federated Learning (FL) enable large-scale LLM training across private or distributed data sources. While beneficial for privacy and scalability, they expose new vulnerabilities: we demonstrate that a single malicious client can successfully implant jailbreak triggers to compromise safety alignment.We identify two potential server-side defenses—Malicious Output Scrutiny (MOS), which detects unsafe generations, and Task Performance Check (TPC), which filters out updates with degraded downstream performance. To bypass both, we propose \textit{CloudGhost}, a trigger-based jailbreak strategy with two key innovations: (1) \textbf{Trigger-based Pseudo-Contrastive Safety Alignment (TPCSA)}, which conceals malicious behavior unless a secret trigger is present; and (2) \textbf{Downstream-preserved Malicious Training (DPT)}, which uses Fisher regularization to preserve downstream performance.Experiments on LLaMA-2 and LLaMA-3 demonstrate that a few attackers can easily achieve an Attack Success Rate (ASR) exceeding 70\% while maintaining a Detection True Rate (DTR) below 5\%, without degrading downstream performance.

Chat is not available.