Will an AI's utility function change over time?

2 min read

An agent generally wants to keep valuing whatever it already values, a motive sometimes called "goal-content integrity". The reason is simple: if the agent keeps valuing it, there will be more of it. This point is often illustrated with the example of Gandhi considering whether to take a murder pill. If he takes the pill, he'll start valuing murder, causing him to murder people. But by his current values, murder is a terrible consequence, so he refuses the pill.

This logic makes preserving the utility function an "instrumentally convergent" goal, valuable as a step toward a wide range of terminal goals.

That said, this general motivation can be outweighed by specific circumstances. For example, agents could benefit from changing their utility functions in negotiations with other agents. Also, agents can be programmed with “indirect normativity”, so they’ll update their best guess at what utility function they should maximize, while keeping their overall framework stable.

Whether AI systems will maximize a utility function in the first place is also a matter of debate.

Utility Functions