What is "Do what I mean"?

3 min read

Suggest changes in Google Docs

“Do what I mean” is an alignment strategy in which the AI is programmed to try and do what the human meant by an instruction, rather than following the literal interpretation of the explicit instruction (akin to following the spirit of the law over the letter). This potentially helps with alignment in two ways: First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean", it will be open to accepting correction.

This approach contrasts with the more typical “do what I say” approach of programming an AI by giving it an explicit goal. The problem with an explicit goal is that if the goal is misstated, or leaves out some detail, the AI will optimize for something we don’t want. Think of the story of King Midas, who wished that everything he touched would turn to gold and died of starvation.

One "Do what I mean" proposal is "Cooperative Inverse Reinforcement Learning", in which the goal is hidden from the AI. Since it doesn't have direct access to its reward function, the AI will try and discover the goal from the things you tell it and from the examples you give it. Thus, it slowly gets closer to doing what you actually want.

For more information, see Do what we mean vs. do what we say by Rohin Shah, in which he defines a "do what we mean" system, shows how it might help with alignment, and discusses how it could be combined with a "do what we say" subsystem for added safety.

For a discussion of a spectrum of different levels of "do what I mean" ability, see Do What I Mean hierarchy by Eliezer Yudkowsky.

Wouldn't a superintelligence be smart enough to avoid misunderstanding our instructions?