Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
Abstract
Large Language Model Red-Teaming, which proactively identifies vulnerabilities of large language models, is an essential process for ensuring safety. Finding effective and diverse attacks in red team activities is important, but achieving both is challenging. Generative Flow Networks (GFN) that perform distribution matching are a promising method, but they are notorious for training instability and mode collapse. In particular, unstable reward functions in red team activities accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.