LLM Scheming Inversely Scales with Pretraining Language Coverage
Abstract
With the growing capabilities of frontier models, AI alignment becomes increasingly critical in high-risk deployment settings. While recent work has empirically demonstrated in-context scheming—the covert pursuit of misaligned objectives while feigning alignment—in frontier language models, most work has been performed exclusively in English, leaving a major gap in multilingual safety. We apply Petri, an open-source automated auditing framework, to Qwen3-30B-A3B to evaluate deceptive and scheming behaviors across multiple languages. Our findings suggest that model scheming scores across high to low-resource languages have a clear correlation, with mean scheming scores in lower-resource languages reaching up to 39.7\% higher than higher-resource languages. Furthermore, we establish that the effect of pretraining language coverage does not uniformly affect specific scheming behaviors.