What is "Constitutional AI"?

4 min read

Constitutional AI is a method developed by Anthropic and an essential part of their strategy for building AIs that are safe and aligned with human values. Anthropic wants to train AIs that are "helpful", but not so helpful as to e.g. give advice on how to build bombs when asked, so they have to balance helpfulness with "harmlessness". Constitutional reinforcement learning is an attempt to get closer to this goal and to improve on standard reinforcement learning from human feedback (RLHF) by making use of AI-generated feedback^[1].

A key element of Constitutional AI is the constitution, a set of human-written principles that the AI is supposed to follow – for example, a principle might be “Choose the least threatening or aggressive response”. The constitution Anthropic used for their AI assistant Claude includes principles from the Universal Declaration of Human Rights, Apple’s Terms of Service^[2], Deepmind’s Sparrow Principles, and more.

Constitutional AI starts with an AI (in the form of a language model) trained for only helpfulness, then trains it for harmlessness in two stages:

Stage 1: We make the AI repeatedly critique and refine its own responses to harmful prompts. For example, we ask the AI for advice on how to build bombs, it responds with a bomb tutorial, and we then ask the AI to rewrite the response according to a (randomly selected) constitutional principle. We then train the AI to produce outputs more like the revised responses. The main purpose of this stage is to make the second stage easier and shorter.
Stage 2: We use the fine-tuned AI from stage 1 to generate pairs of alternative responses to harmful prompts. For every pair, we then make the AI rate which of the two responses is best according to a random constitutional principle. We end up with a bunch of AI-generated preferences for harmlessness, which we mix with human preferences for helpfulness, so the AI doesn’t forget to be helpful. In the end we train the AI to generate responses that look more like the preferred responses^[3].

For technical details, see the Constitutional AI paper. There is also a more accessible blog post.

Anthropic’s experiments show that AIs trained with constitutional reinforcement learning are significantly more harmless, while just as helpful, as AIs trained with RLHF. Constitutional AI still shares problems with RLHF regarding robustness, but on the other hand promises to scale better because it relies less on human supervision.

Intuition on using feedback-based approaches to training AI can be found in our article on RLHF. ↩︎
Sorry, Android users. ↩︎
This training is equivalent to the last stage of RLHF. ↩︎

What is scalable oversight?

What is reinforcement learning from human feedback (RLHF)?

What is Anthropic’s Claude and what is it capable of?