What is outer alignment?

Outer alignment, also known as the reward misspecification problem, is the problem of training an AI with the right optimization objective, i.e., “Did we tell the AI the correct thing to do?”. This is distinct from the inner alignment problem, the problem of whether an AI in fact ends up trying to accomplish the objective we specified (as opposed to some other objective).

Outer alignment is a hard problem. It has even been argued that, to convey the full “intention” behind a human request would require conveying all human values, which are themselves not well-understood. Additionally, since most models are designed as goal optimizers, they are susceptible to Goodhart’s Law. This indicates that even if we have specified our goal in a way that looks good to humans, there might be negative consequences that arise due to excessive optimization that we might be unable to foresee.

Some sub-problems of outer alignment that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Paul Christiano, a researcher who focuses on outer alignment, has proposed solutions such as HCH or iterated distillation and amplification. There have also been proposed solutions to approximate human values using imitation and feedback learning techniques.