Could AI alignment research be bad? How?

4 min read

Suggest changes in Google Docs

While the hope is that AI alignment research will lead to good outcomes, there are a few noteworthy ways it could be bad.

1. Accelerating capabilities

Many aspects of alignment research are also relevant for increasing AI capabilities. If a particular approach to alignment research also accelerates the development of AI capabilities, it could be net negative for safety.

For example, RLHF is the technique of using human feedback to attempt to align an AI's goals with human desires. But it has been shown that RLHF can also increase an AI's learning efficiency. If RLHF is not sufficient to align an AI (for example, if it leads to deception), then this research might simply result in more capable, unaligned AIs. Even if the research is helpful for alignment, the tradeoff might not be worth it. An argument that the research is more useful for alignment than for improving capabilities is insufficient, since not all research can be done in parallel, and the research might not help with the parts of alignment which need the most time to develop (and therefore not actually speed up the timeline of alignment research on net).

Similarly, the results of interpretability research can be used to better understand current models and therefore might help design new models.

Another way that alignment research could accelerate capabilities research is by drawing more investment into the field of AI in general.

2. A false sense of security

AI alignment research could also make companies more comfortable with deploying an unsafe system. If a system were completely unaligned and couldn’t reliably do what its designers wanted, they would be unlikely to deploy it. However, if alignment research leads to incompletely or deceptively aligned systems, they might behave sufficiently "correctly" that a company would choose to release them.

For example, RLHF could remove all of the obvious problems with the system, but when deployed, cause problems in situations very different from its training environment.

3. A near miss inducing an s-risk

A third problem could arise if AI alignment research reaches a high level but is not perfect. That could lead to a “near miss” scenario in which a system is aligned with something close to human values but is missing a critical element. A near miss could be worse than a completely unaligned AI, since it could lead to extreme suffering (an "s-risk"), which is arguably worse than extinction¹ from a completely unaligned AI.

This means that when developing an AI, we want to be careful that our solution is distant from other catastrophic “solutions”, so that we don’t accidentally fall into one of those outcomes.

See Brian Tomasik’s writing or Max Daniel’s talk on this subject. ↩︎

What are astronomical suffering risks (s-risks)?