What is corrigibility?
A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to correct the agent, or to correct our mistakes in building it, despite the instrumentally convergent reasons to oppose these corrections.
-
If we try to pause or “hibernate” a corrigible AI, or shut it down entirely, it will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill its goals.
-
If we try to reprogram a corrigible AI, it will not resist this change and will allow this modification to go through. If this behavior is not specifically incentivized, an AI might try to prevent this, such as by attempting to fool us into believing its utility function was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI's current preferences.
Further reading:
- Corrigibility As Singular Target (CAST) by Max Harms