What is "metaphilosophy" and how does it relate to AI safety?

2 min read

Suggest changes in Google Docs

Metaphilosophy is the study of how philosophy is developed. Wei Dai introduced this idea as a way to think about AI safety. The concern is that an AI might be committed to bad philosophical assumptions about e.g. ethics, decision theory, infinity, or the computability of the universe, but might still have enough practical intelligence to cause an existential disaster.

While humans have made progress in philosophy, we don't have an algorithm we can follow to guarantee further progress. Dai proposes that by better understanding metaphilosophical questions, we can create a formal method for making philosophical progress that can be programmed into a "white-box metaphilosophical AI".

Another approach is to have the AI learn how to make philosophical progress based on human examples, which Dai calls "black-box metaphilosophical AI", because we wouldn’t understand how it worked on the inside. However, this would require an advanced AI whose safety would be hard to ensure.

Alternatively, we could solve the philosophical problems ourselves. Dai has argued this approach is also less than promising: the track record of human philosophy suggests that without AI assistance, we’re unlikely to anticipate all the relevant problems which could arise with the emergence of AGI.

At a high level, what is the challenge of AI alignment?

What are the main sources of AI existential risk?

Which moral theories would be easiest to encode into an AI?