In "aligning AI with human values", which humans' values are we talking about?

4 min read

Imagine tomorrow we had an AGI that we could program with values. What should these values be? Which of the 8 billion or so humans alive today should be represented in these values? Should people of the future be represented if we are forced to lock-in its values? These are an important questions that concern both philosophers^[1] and AI safety engineers.

One view is that we should focus on developing techniques to train an AI with an arbitrary goal, and only worry about the specific goals once we have those techniques. According to this view, the hard problem is how to reliably get a superintelligence to pursue any goal whatsoever (without causing human extinction in the process). For these researchers, worrying about which values to implant is like worrying about where to send a space mission when our rockets still explode on the launch pad.

A second view maintains that we can’t develop universal methods of instilling values into an AI system because instilling different types of values require different training methods. Specifically, human values (with all of their nuance and complexity) would need to be learned differently than a specific simple task -the training techniques themselves demand a concrete understanding of human values (and how they are learned) to be successful.

Some AI researchers are worried about this problem for another reason:. Even our own values change over time, and we want to make sure that there is room for values to develop as humanity matures ethically. One possible solution is coherent extrapolated volition (CEV): having an AI aim at the goals that we, humanity, would pursue if we knew more, were more morally mature and lived more in line with who we wish we were, thus extrapolating an ideal form of human values (which all humans would naturally converge towards). However, since this extrapolation might not match the current set of human values, it could result in the AI following some future values which appear entirely alien to the current humans.

A further difficulty is how to aggregate different people’s values. For instance, if Alice wants all humans (not merely herself) to live according to certain principles (e.g. religious ones) and Bob does not want this, such preferences could never both be maximally satisfied, even if the superintelligence were to align large chunks of the universe to each of their preferences. Changing Alice’s preference so that she does not care whether Bob acts in that way would resolve the contradiction between their preferences, but would involve changing Alice in a way that she would not approve of (at least before the change).

This is the object of social choice theory, a subfield of economics and philosophy dedicated to formally studying how to aggregate the preferences of many individuals into a single ‘social welfare function’. On the more empirical side, a 2022 paper from researchers at DeepMind investigated how we might use LLMs to “help people with diverse views find agreement”. This is still an unsolved problem, but a first step for a proposal to solve this is simulated deliberative democracy. This opens the door to further ethical questions such as who will count as a human being, and what to do with internally conflicted values. And practical problems, such as ML systems being trained on datasets reflecting the opinions of the groups most active on the internet, who are not representative of humanity as a whole.

Bostrom calls this “Philosophy with a deadline” on page 255 of his book Superintelligence. ↩︎

What are "human values"?

What is "coherent extrapolated volition (CEV)"?

Might an aligned superintelligence force people to change?