Structured Multi-step Jailbreaking under a Hamiltonian Generative Formulation
Abstract
Recent work shows that even safety aligned large language models (LLM) can be pushed into unsafe behavior by carefully crafted jailbreak prompts. Existing jailbreaking attack methods often rely on disfluent or incoherent prompts, which limit their success and make them easy to detect. We introduce SJA, a structured jailbreak attack built around two ideas. First, inspired by the logic of Spilsbury puzzle, SJA decomposes a harmful query into a sequence of harmless sub-questions and reconstructs the original answer by combining the sub-question responses. Second, by leveraging the theory of Hamiltonian dynamics on hyperbolic space, we propose a hyperbolic Hamiltonian dynamics-based sub-question generation framework that effectively captures the structural and temporal dependencies. We provide a theoretical analysis of how each sub-question evolves along the trajectory and show that the hyperbolic Hamiltonian system effectively captures the underlying semantic structure. Finally, we propose a hyperbolic narrative fusion mechanism built on fractional embedding and Möbius fusion. This mechanism integrates coherent narratives into sub-questions while preserving geometric consistency and improving stealth performance. We theoretically validate that the combination of the generated harmless sub-questions, guided by the stealthy narrative, can effectively preserve the contextual semantics of the original harmful question.