Might AI alignment research lead to outcomes worse than extinction?

3 min read

Suggest changes in Google Docs

AI alignment is mostly oriented towards avoiding existential risks, but there is also a risk that an unaligned AGI might produce scenarios that lead to s-risks. However, some have theorized that attempting to align an AI could be the cause of s-risks.

S-risks are unlikely if everyone is dead¹, as might happen in a situation where an unaligned AGI is deployed or in a host of other extinction scenarios related to AI. In contrast to a nuclear apocalypse where the worst possible outcome is that we all die, powerful AI might be a source of s-risk in and of itself.

On the one hand, AGI could be the source of the s-risk through e.g. AI-enabled totalitarianism. One could imagine a malevolent human or AI could enslave humanity and torture humans, as is explored in the science-fiction story I Have No Mouth, and I Must Scream.

On the other hand, Yudkowsky has theorized that if we were to attempt to give a powerful AI a good specification of human values V, then a bug in the programming might lead it to optimize for -V instead², which would constitute the opposite of human values. Under this interpretation, specifying a maximally desirable outcome makes it marginally more likely that a maximally undesirable outcome would happen. This idea also appears in the Waluigi effect.

Finally, one could imagine that in the process of figuring out the best way to achieve an objective related to human values, an aligned AGI might simulate entire populations to see how they react to certain situations. If these simulations are detailed enough, then the simulations might hold qualia (suffering subroutines), and any simulation that would make them suffer might constitute an s-risk.

There could be situations where all humans die but s-risks still arise. For instance, some humans may be cryogenically revived or simulations of humans or other beings with qualia might be tortured. ↩︎
This has already happened in an OpenAI model albeit in a harmless way. ↩︎

Might AI alignment research lead to outcomes worse than extinction?

In progress