Could weak AI systems help with alignment research?

1 min read

This is a subject of dispute among AI alignment researchers.

Some alignment strategies are based on such a mechanism. For example, AI safety via debate and iterated distillation and amplification.

In fact, for some organizations, this is a key part of their research strategy. For example, OpenAI’s approach to alignment focuses on training models to assist human evaluation and training models to do AI research.

Other researchers think it is not only unhelpful, but actively harmful, since any AI advanced enough to be helpful, will be just as difficult to align as the AGI they are supposed to be helping with. And therefore will have essentially the same risks.

What are the most promising plans for automating alignment research?

Objections