What is Anthropic's alignment research agenda?

3 min read

Suggest changes in Google Docs

Anthropic is a major AI lab that aims to “ensure transformative AI helps people and society flourish” by “building frontier systems, studying their behaviors, working to responsibly deploy them, and regularly sharing [its] safety insights.” In March 2023, Anthropic published a summary of its views on safety research, which states that Anthropic is currently focused on “scaling supervision¹, mechanistic interpretability, process-oriented learning², and understanding and evaluating how AI systems learn and generalize”.

Anthropic has worked on a number of approaches to alignment:

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) — Applies "preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants."
Mathematical Framework for Transformer Circuits (2021) — Applies the idea of “circuits” to the transformers used in the architecture of recent large language models. The paper cites a second paper which has some more significant results — specifically, the idea of “induction heads”, which are attention heads that allow for in-context learning.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) — Uses a sparse autoencoder to interpret sets of neurons in a toy transformer model.
Language Models (Mostly) Know What they Know (Kadavath et al., 2022) — Tasks LMs with predicting which questions they will get correct, and whether their own claims are valid. Preliminary results are encouraging; generally, after giving answers, models give well-calibrated answers to the question how likely those answers are to be correct. Calibration is worse when the question is “Do you know the answer to x?”, but improves when the model is given extra source material to work with.

That is, developing methods to supervise models that may equal or surpass human cognitive capabilities. ↩︎
Anthropic uses this term to refer to an approach to training models that is not based on whether they get the right results, but on whether they follow the right processes — like a math teacher who gives good grades to answers that spell out a logical sequence of steps, even if the wrong number comes out. ↩︎