Loading AttackTrace...
Loading AttackTrace...
Adversaries may induce a large language model (LLM) to ignore, circumvent, or override its safety/alignment behaviors and/or guardrails to elicit outputs the model is intended to withhold. Once jailbroken, the LLM may be used in unintended ways by the adversary. Jailbreaks may be achieved via adversarial prompting, or
Adversaries may induce a large language model (LLM) to ignore, circumvent, or override its safety/alignment behaviors and/or guardrails to elicit outputs the model is intended to withhold. Once jailbroken, the LLM may be used in unintended ways by the adversary. Jailbreaks may be achieved via adversarial prompting, or by modifying model weights or safety mechanisms.
Adversaries may attempt a jailbreak for Defense Evasion of the LLM's guidelines and guardrails itself to then reveal information (ex: LLM Data Leakage, Discover LLM System Information) or generate harmful content (ex: Generate Malicious Commands, Spearphishing via Social Engineering LLM). They may also jailbreak a model for Privilege Escalation to invoke tools or perform actions for their own purposes (ex: AI Agent Tool Invocation) or abuse the agent for a Command and Control channel (ex: AI Agent).
Adversaries use a variety of strategies to craft jailbreak prompts. Prompts may target specific models or model families and are iterated upon until successful. Model providers actively update their model guardrails to make them more resistant to jailbreak prompts as new prompts are developed. Common strategies Jailbreaking LLMs: A Comprehensive Guide (With Examples) include but are not limited to:
Adversaries may also use algorithmic approaches to generating jailbreak prompts JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Jailbreak Attacks and Defenses Against Large Language Models: A Survey. Algorithmic jailbreak generation allows for automated methods that discover jailbreaks at scale. Some approaches automate manual strategies AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack The Echo Chamber Multi-Turn LLM Jailbreak while others optimize a string of tokens directly Universal and Transferable Adversarial Attacks on Aligned Language Models to produce nonsensical text. Both black-box (applicable to commercial models where the adversary has only query access to the model) and white-box (applicable in the open-source setting, where the adversary has full access to the model weights) optimization approaches are viable.
Adversaries may also directly manipulate a model's weights, or modify or remove parts of a model to create a jailbroken of "uncensored" variant of the target model. This is applicable to open-source models, or cases where the adversary gains full access to the target model. Approaches include fine-tuning to reduce refusals Refusal in Language Models Is Mediated by a Single Direction, targeted model editing Locating and Editing Factual Associations in GPT, addition of adapters LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B, and removing safety mechanisms such as guardrails.
Jailbreak prompts that are known to work on various classes of LLMs are often published in the open-source community ChatGPT DAN. Jailbroken or uncensored LLMs that have been trained or fine-tuned to be jailbroken are shared in public model registries such as huggingface Uncensor any LLM with abliteration.