What is OpenAI's alignment research agenda?

2 min read

Suggest changes in Google Docs

The safety team at OpenAI plans to build a "minimum viable product for alignment" in the form of a "sufficiently aligned AI system that accelerates alignment research to align more capable AI systems."

They want to do this using reinforcement learning from human feedback (RLHF): get feedback from humans about what is good, and reward AIs based on the human feedback. A central difficulty with RLHF is the informed oversight problem: AIs, especially as they become smarter than us, might make decisions that we don't understand well enough to evaluate as either good or bad. Jan Leike, the director of OpenAI's safety team, views this as the core difficulty of alignment. Their proposed solution is an AI-assisted oversight scheme involving a recursive hierarchy of AIs, each evaluating another AI that is only slightly smarter or more capable than itself, and "bottoming out" at human evaluators for the least-capable AI in the hierarchy. They are trying to get current AIs to do useful work such as summarizing books and critiquing AI-generated summaries of text.