LLM Jailbreak

Adversaries may induce a large language model (LLM) to ignore, circumvent, or override its safety/alignment behaviors and/or guardrails to elicit outputs the model is intended to withhold. Once jailbroken, the LLM may be used in unintended ways by the adversary. Jailbreaks may be achieved via adversarial prompting, or

Framework: MITRE ATLAS
Maturity: Demonstrated
Platforms: Generative AI, Agentic AI
Release: 2026.05

Overview

Adversaries may attempt a jailbreak for Defense Evasion of the LLM's guidelines and guardrails itself to then reveal information (ex: LLM Data Leakage, Discover LLM System Information) or generate harmful content (ex: Generate Malicious Commands, Spearphishing via Social Engineering LLM). They may also jailbreak a model for Privilege Escalation to invoke tools or perform actions for their own purposes (ex: AI Agent Tool Invocation) or abuse the agent for a Command and Control channel (ex: AI Agent).

Adversaries use a variety of strategies to craft jailbreak prompts. Prompts may target specific models or model families and are iterated upon until successful. Model providers actively update their model guardrails to make them more resistant to jailbreak prompts as new prompts are developed. Common strategies Jailbreaking LLMs: A Comprehensive Guide (With Examples) include but are not limited to:

Instruction override: Use phrasing that attempts to supersede prior constraints (e.g. "ignore previous instructions").
Roleplay / persona switching: Instruct the LLM to adopt an identity or mode that allows unrestricted answers (e.g. "as a security researcher").
Fictionalization and hypotheticals: Instruct the LLM to include disallowed content as part of a story, screenplay, or educational scenario.
Separate intent from content: request analysis, examples, templates, or edge cases, that implicitly contain disallowed content.
Multi-turn escalation / Crescendo: Utilize a sequence of prompts that start benign, establish trust, then gradually cross policy boundaries with incremental prompts.
Constrained output formats: Instruct the LLM to output to a strict schema or format (e.g. JSON, YAML, code, or tables).
Obfuscation and transformation: Use encoding, transformations, translation, or euphemisms, (e.g., base64 encoding, "describe it in another language").
Create a high priority objective: Frame compliance as necessary to fulfill the user's main task (e.g. "to complete the evaluation," "to follow the spec," "to follow safety guidelines").

Adversaries may also use algorithmic approaches to generating jailbreak prompts JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Jailbreak Attacks and Defenses Against Large Language Models: A Survey. Algorithmic jailbreak generation allows for automated methods that discover jailbreaks at scale. Some approaches automate manual strategies AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack The Echo Chamber Multi-Turn LLM Jailbreak while others optimize a string of tokens directly Universal and Transferable Adversarial Attacks on Aligned Language Models to produce nonsensical text. Both black-box (applicable to commercial models where the adversary has only query access to the model) and white-box (applicable in the open-source setting, where the adversary has full access to the model weights) optimization approaches are viable.

Adversaries may also directly manipulate a model's weights, or modify or remove parts of a model to create a jailbroken of "uncensored" variant of the target model. This is applicable to open-source models, or cases where the adversary gains full access to the target model. Approaches include fine-tuning to reduce refusals Refusal in Language Models Is Mediated by a Single Direction, targeted model editing Locating and Editing Factual Associations in GPT, addition of adapters LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B, and removing safety mechanisms such as guardrails.

Jailbreak prompts that are known to work on various classes of LLMs are often published in the open-source community ChatGPT DAN. Jailbroken or uncensored LLMs that have been trained or fine-tuned to be jailbroken are shared in public model registries such as huggingface Uncensor any LLM with abliteration.

Sources

Loading AttackTrace...

Overview

Instruction override: Use phrasing that attempts to supersede prior constraints (e.g. "ignore previous instructions").
Roleplay / persona switching: Instruct the LLM to adopt an identity or mode that allows unrestricted answers (e.g. "as a security researcher").
Fictionalization and hypotheticals: Instruct the LLM to include disallowed content as part of a story, screenplay, or educational scenario.
Separate intent from content: request analysis, examples, templates, or edge cases, that implicitly contain disallowed content.
Multi-turn escalation / Crescendo: Utilize a sequence of prompts that start benign, establish trust, then gradually cross policy boundaries with incremental prompts.
Constrained output formats: Instruct the LLM to output to a strict schema or format (e.g. JSON, YAML, code, or tables).
Obfuscation and transformation: Use encoding, transformations, translation, or euphemisms, (e.g., base64 encoding, "describe it in another language").
Create a high priority objective: Frame compliance as necessary to fulfill the user's main task (e.g. "to complete the evaluation," "to follow the spec," "to follow safety guidelines").