What is AI alignment?
Informally, AI alignment means making an AI’s goals line up with some target set of values, such as those of its creators.1
A simple model
Imagine a hypothetical AI system with two separate parts:
-
A set of “goals”, “preferences”, or “values”2, meaning which outcomes it will act to achieve over other outcomes.
-
A set of “beliefs” or a “world model”, meaning what it considers to be true about the world and what it predicts will happen.
When this AI makes a decision, it considers each possible action it could take, uses its beliefs about the world to predict the result of that choice, and then uses its preferences to judge how good that result is. It then picks the choice it expects to lead to the best result.
The concept of “alignment” is relatively straightforward here: the system is “aligned” with you to the extent that its values are the same as yours, and “misaligned” with you to the extent that they are different.
Historically, discussions of the danger of AI misalignment have often used scenarios involving AI systems with this structure. In one such scenario, you have a very powerful AI, and you want to use it to cure cancer. The most naive strategy for achieving this might be to give it the goal of “minimize the number of cancer cases” — which the AI might conclude would be most effectively achieved by killing all humans. More sophisticated alignment strategies3 could involve coding in more complex specifications of human values, or having the AI learn values over time.
This simple model is a type of goal-directed AI. If powerful AIs necessarily behave in goal-directed ways, then it becomes easy to see how they would be very dangerous. The powerful AI learns the wrong goal, and for most goals, human flourishing isn’t how you maximize them. But will powerful AI be goal-directed in this way?
Current systems
Current frontier AI systems don’t seem to have the "values" and "world model" structure described above. So it’s unclear whether the idea of “aligning” such systems is meaningful.
For example, an LLM like ChatGPT is created by training a huge neural network to predict human text as accurately as possible, and then fine-tuning it to favor text completions that were rated highly by human evaluators. We don’t know how the resulting system works. Superficially, the LLM imitates many patterns associated with purposeful decision-making, which is what we’d expect from an LLM which scored highly according to human evaluators. But in many ways, the LLM appears to have incoherent preferences or fail to choose good outcomes for any goal. So we have no special reason to think that it’s trying to predict the consequences of its decisions and check them against some set of values in any systematic way.
Still, the concept of “alignment” has some meaning for current systems. We can say that ChatGPT “is capable of” spewing abuse, because it has learned how to predict abusive internet text. Yet its cognition has been chiseled by RLHF in such a way that it (usually) “chooses” not to do so. In that sense, it is (mostly) “aligned” with OpenAI’s intended values.
Future systems
It’s not clear how similar future, smarter-than-human AI systems will be either to current AI systems, or to the simple model described above. Some argue that future systems will systematically optimize their environments, like in the simple model. Typically, these arguments are based on coherence theorems and selection theorems. Others think this sort of goal-directed consequentialism won’t appear by default, or can’t appear at all. Some of those arguments focus on how non-goal-directed current systems seem, and argue that future systems will behave like current ones.
There are a lot of subtleties to the concept, and different people use it in different ways. In this article, we’ll just try to convey a basic idea. ↩︎
When we're talking about humans, words like “preferences” and “values” sometimes have connotations that we don’t mean to invoke here. For example, we’re not saying an AI's preferences are a kind of emotional state, or that it has "values" in an ethical or moral sense. “Preferences” and “values” here are defined purely in terms of which outcomes the system will tend to choose over others. ↩︎
Some sophisticated strategies are the end-result of a long sequence of patches fixing one problem after another with a poor, initial strategy. Such strategies are likely to suffer from the problem of “The Nearest Unblocked Strategy”. ↩︎