What is alignment failure?

2 min read

Alignment failure, at its core, is any time an AI's output deviates from what we intended. We have already witnessed alignment failure in simple AIs. Mostly, these amount to correlation being equated to causation. A good example was an AI built by Youtube to recognize animals being forced to fight for sport. The videos given to the AI were always set in some kind of arena, so the AI drew the simplest conclusion and matched videos where there were similar arenas—such as with robot combat tournaments.

Another set of problems are those stemming from the AI fulfilling the literal specifications without achieving the actual intended goals. Deep Mind has a good article describing various such situations, e.g. the animation below is from a game where AI was given a reward for hitting the green blocks on the race track, so it went round in circles doing that, rather than actually winning the race.

These examples may seem trivial, but they illustrate the general idea.

Obvious examples of misalignment will be caught during training and not be deployed. More sinister problems might be caused by AIs with authority that are subtly misaligned, or in such a way that doesn’t become apparent until it’s deployed. An example are police systems, which often have inherent biases stemming from their training data.

Check this video for more.

What would happen if there was alignment failure with a superintelligence?