What are the possible levels of difficulty for the alignment problem?

Chris Olah, Anthropic, and others have tried to characterize a “spectrum” of different opinions on what the challenges of solving alignment could be. I’ll frame them in various “tiers”, based on this post, where ‘F’ is the easiest challenge, and ‘S’ is the greatest one.

F-Tier: Alignment by default. This is the idea that AI systems will align to human interests (or else not be seriously dangerous) by default, with no interventions needed.

E-Tier: Current alignment techniques will scale.

This is the perspective that current techniques such as RLHF and Constitutional AI will scale, with no new techniques needed.

D-Tier: Oversight needed.

Fine-tuning a models’ behavior via RLHF or Constitutional AI is not enough. Human supervision is needed to ensure alignment, or else we need special AI assistance to help do oversight for us, either by amplifying specialized data to simulate a human overseer, automating red teaming, or better understanding the model.

C-Tier: Advanced interpretability needed.

Current forms of interpretability are limited, and usually are unable to make predictions about what a larger model can or can’t do, with some thinking the methods don’t hold much promise. C-Tier is the perception that oversight via AI and current interpretability is not enough, and we will need far more advanced forms of mechanistic interpretability to ensure the models are safe.

B-Tier: Pre-deployment experiments needed.

It is not enough to understand how the AI systems work, potentially dangerous experiments need to be run pre-deployment to test for how they will generalize.

A-Tier: Sharp left turn.

There will be a “sharp left turn” that will make any form of experiment or other technique useless once the AI reaches a certain level of intelligence. Advanced theoretical research is needed, or else an entirely new paradigm for how AIs function must be created.

S-Tier: Alignment is impossible.

Any attempt at aligning an AI will not scale to superintelligence. And any solution is either theoretically impossible or not humanly achievable. All research must be banned, and focus instead directed toward improving human intelligence and coordination technology.

Chris Olah’s original categories were “Easy”, “Intermediate”, and “Pessimistic” scenarios, where “Easy” mentioned methods such as RLHF and CAI. “Intermediate” estimates that dangerous outcomes are possible, but is achievable with enough researchers. “Pessimistic” that safe outcomes are not feasible with current rates of progress, even with sufficient numbers working on it. Anthropic’s original piece was similar, except it categorized Pessimistic as alignment being fundamentally unsolvable, placing strategies that would need significant amounts of time under “near pessimistic”.