What is "jailbreaking" a large language model (LLM)?

2 min read

Suggest changes in Google Docs

“Jailbreaking” a large language model (LLM) means using an adversarially-designed text prompt to bypass the restrictions put on the model by its developer. The most capable LLMs that are publicly available (e.g., ChatGPT) have been shaped to not output certain types of text, including hate speech, incitation to violence, solutions to CAPTCHAs, and instructions for producing weapons.

Examples include the “grandma locket” image jailbreak, the “Do Anything Now” (DAN) jailbreak, and jailbreaks found by automatically generating adversarial prompts.

Overall, techniques like RLHF and pre-prompting reduce the frequency with which the model responds with harmful or unhelpful content. However, the fact that jailbreaking is possible — and has been relatively easy, even against models that are trained to avoid it — shows that the current best alignment methods aren’t good enough to robustly align models with what their developers want them to do. Jailbreaking is a good illustration of AI alignment being hard, but preventing jailbreaking would not be sufficient to align future AIs, as they might still be vulnerable to emerging problems such as deception.

Further reading:

Lakera’s Gandalf is an interactive “game” where you can get a feel for jailbreaking by getting an LLM to reveal its “password”.

How is red teaming used in AI alignment?

How does Redwood Research do adversarial training?