Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #2900

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei ⋅ Peizhi Niu ⋅ Xinjie Shen ⋅ Tony Tu ⋅ Yifan Li ⋅ Ruihan Wu ⋅ Eli Chien ⋅ Pin-Yu Chen ⋅ Olgica Milenkovic ⋅ Pan Li

Project Page

Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails. Existing approaches overwhelmingly operate within the prompt-optimization paradigm; the resulting prompts typically retain malicious semantic signals that modern guardrails are primed to detect. In contrast, we identify a deeper vulnerability stemming from the highly interconnected nature of an LLM’s internal knowledge. This structure allows harmful objectives to be realized by weaving together sequences of benign sub-queries, each of which individually evades detection. To exploit this loophole, we introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model’s knowledge base. The CKA-Agent issues locally innocuous queries, uses model responses to guide exploration across multiple paths, and ultimately assembles the aggregated information to achieve the original harmful objective. Evaluated across SOTA commercial LLMs, CKA-Agent consistently achieves over 95\% success rates even against strong guardrails, underscoring the severity of this vulnerability and the urgent need for defenses against such knowledge-decomposition attacks. Our codes are available at https://github.com/Graph-COM/CKA-Agent.

Lay Summary

Large language models store and use knowledge in a highly connected way: one piece of information can lead to many related facts. This creates a safety risk. Even when a model refuses a direct harmful request, a user may still recover the same harmful information by asking a sequence of harmless-looking questions about related knowledge. Each question may appear safe on its own, but the collected answers can be combined to achieve the original unsafe goal. In this paper, we introduce CKA-Agent, a framework for testing this correlated-knowledge vulnerability. Instead of directly asking for unsafe content, CKA-Agent adaptively explores the target model’s related knowledge through benign sub-questions. It uses the model’s responses to decide which knowledge path to follow next and can switch to alternative paths when one path is blocked. Our experiments show that this knowledge-guided strategy can bypass strong safety guardrails more effectively than prior jailbreak methods. These findings suggest that future safety systems need to reason over connected knowledge and multi-turn intent, rather than judging each prompt in isolation.