Isn't it futile to align a superintelligence because its values will inevitably drift or evolve in the long run?

3 min read

It seems far from inevitable that values will drift in arbitrary directions. If an AGI can predict that its values will drift somewhere undesirable, this gives it a motive to stabilize itself. In the long run, with access to their own source code and the ability to rewrite themselves or design their successors, AGIs can prevent various kinds of changes to their values:

Hardware malfunction and other random events. As long as such events are rare, error correction can bring their probability down to a negligible level. For example, if there's a one in a thousand chance that any given system will fail, then the probability that a majority out of a hundred separate systems will fail at the same time, for independent reasons, is essentially zero.
Side effects of new information. If the AI is built with indirect normativity, it will deliberately make value changes that represent updates on new information or insight. But it still cares about preserving the integrity of this updating process, and can prevent other, unintended value changes by choosing an architecture where its values are separated from its world model, or one that is highly interpretable and understandable. It can also use techniques like experimenting with its copies a lot and resetting itself back to known states.
Interactions with other agents. Deals with trade partners, or attacks by intelligent adversaries, could cause value changes. For example, different AI systems could merge into a system whose values were a compromise between those of the individual systems. However, if an AI with a particular set of values has stable control over the world, it has no need to compromise in this way.

Once a superintelligence has secured its power over potential competitors, it’s in an extremely strong position to solve value drift problems: it can choose to replace itself with any possible program or set of programs, and it has vast amounts of time to invent, test, and carefully implement its strategies. Even if any particular plan has a flaw, it can probably come up with others.

The dynamics determining which values would succeed in a world dominated by AI would be very different from biological evolution. An AI could easily make perfect copies of itself. Under a singleton, agents would gain influence based on whether the singleton chose for them to have influence – for example, because their values were aligned with it. Other selection pressures on agent values would become unimportant. Even in a multipolar scenario, agents with a wide range of possible terminal values could use the same instrumental strategies and be similarly successful at propagating their values.

For a more detailed argument why values could probably be locked into an AI indefinitely, see the report AGI and Lock-In.

Why wouldn't AGI's utility function change over time like with humans?

Objections