Aren't there easy solutions to AI alignment?

3 min read

Suggest changes in Google Docs

If you've been learning about AI alignment, you might have had a thought like, "Why can't we just do [this thing that seems like it would solve the problem]?"

Unfortunately, many AI alignment proposals turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” that don’t actually solve the problem. Some intuitive alignment proposals (that are generally agreed to be flawed, incomplete, or hard to implement) include:

Why can’t we just program the AI not to harm us?
Why can’t we just tell the AI to do what is moral?
Why can’t we just tell the AI to be friendly?
Why can’t we just turn the AI off if it starts to misbehave?
Why can’t we just tell an AI just to figure out what we want and then do that?
Why can’t we just keep an AI contained in a box?
Why can’t we just block an AI from using the internet?
Why can’t we just tell the AI not to lie?
Why can’t we just not build the AI a body?
Why can’t we just raise AI like a child?
Why can’t we just live in symbiosis with AGI?
Why can’t we just use a more powerful AI to control a potentially dangerous AI?
Why can’t we just use Asimov’s laws?
Why can’t we just treat AI like any other dangerous technological tool?
Why can’t we just not build AI?
Why can’t we just solve alignment through trial and error?

Some common ways alignment proposals fail are that the proposed solution:

Requires human observers to be smarter than the AI. Many safety solutions only work when an AI is relatively weak, but break when the AI reaches a certain level of capability (for many reasons, e.g., deceptive alignment).
Appears to make sense in natural language, but when properly unpacked is not philosophically clear enough to be usable.
Only solves a subcomponent of the problem, but leaves the core problem unresolved.
Solves the problem only as long as the AI is operating “in distribution” with respect to the original training data (distributional shift will break it).
Might work eventually, but we can’t expect it to work on the first try (and we'll likely only get one try at aligning a superintelligence).