What is value learning?

←Draft written by George Wang during Aug 2023 hackathon →

If you write down rules or an explicit utility function for an AI system to follow, you can often come up with ways the AI can follow the rules in a literal but bad way. A classic example is the paperclip maximizer, where an advanced AI is tasked with creating as many paperclips as possible and proceeds to turn the entire universe into paperclips. A core challenge with explicitly defining our preferences to AI is that even small errors can lead to unexpected negative results.

Value learning is a proposed approach to this challenge. In value learning, an AI learns human values by observing human behavior. It would be intractable for us to take millions of human interactions and convert those directly into a utility function by hand, but an AI system designed to learn from human interactions might be able to infer the right utility function on its own.

We don’t get this kind of AI system design for free, though. For example, a chess-playing AI may not have anything in its source code that allows it to take examples of human behavior as input at all. A more general system that is capable of observing human behavior may not care to follow human values and may use the information to better manipulate and deceive humanity instead. Understanding how we can safely and reliably impart human values is a hard problem and is an important component of aligning advanced AI.

Further reading: