Why wouldn't AGI's utility function change over time like with humans?

An agent generally wants to keep valuing whatever it already values, a motive sometimes called "goal-content integrity". The reason is simple: if the agent keeps valuing it, there will be more of it. This point is often illustrated with the example of Gandhi considering whether to take a murder pill. If he takes the pill, he'll start valuing murder, causing him to murder people. But by his current values, murder is a terrible consequence, so he refuses the pill.

This logic makes preserving the utility function an "instrumentally convergent" goal, valuable as a step toward a wide range of terminal goals.[1]

Despite this, the values of humans do change over time. How come? Human motivations are complex and not well described as coherent maximization of any utility function, which makes it hard to give a definite answer. Here are some of the effects that are at play:

  • Our ability to control our own "source code" is limited. Even when we want to stabilize our values, we don't always succeed: consider an idealistic young person who goes into politics hoping to keep caring about helping people, but then is corrupted into caring only about power.

  • Even where we have control, we may not have understanding. We may worry that, if we try to articulate our values and stabilize ourselves toward caring about them, we'll overlook some crucial part and optimize it away.

  • We have a concept of moral progress: we may see ourselves as constructing our values, or think of our current values as a provisional guess at our "true" values. While we value the integrity of the process as a whole, and would want to stabilize our values against changes that result from, say, cosmic rays, we're often okay with changes that result from reflection.

  • We may change our values to more prosocial ones, because we're trying to cooperate with others and our motivations are partly transparent to them.

Analogous effects can apply to AI systems, which may also come out of a training process without fully coherent goals:

  • If an AGI is under development, and its human coders change its values, it may have no choice in the matter (although it might try to resist that change).

  • Even with full access to its source code, it might realize it doesn't understand the consequences of stabilizing itself. (Maybe it understands some of the difficulty of the alignment problem.)

  • Some AGIs may be programmed with indirect normativity.

  • AGIs may change their utility function as part of a deal with other agents who can read their source code.

Whether agents would see reasons to self-modify toward a fixed utility function in the longer run, as these individual issues were resolved, is a matter of disagreement among the authors of this answer. But in a mature self-modifying superintelligence, the "goal content integrity" argument above at least means its goals won't predictably drift to ones strongly contrary to its current goals.



  1. See section 3 of this paper by Omohundro ↩︎