What is John Wentworth's research agenda?

John Wentworth's alignment plan is to (1) resolve confusions about agency, and (2) do ambitious value learning (i.e. build an AI that correctly learns human values and optimizes for them).

His current approach to (1) is based on selection theorems, which describe what types of agents are selected for in a broad range of environments. Examples of selection pressures include biological evolution, stochastic gradient descent, and economic markets. This approach to agent foundations focuses on observing existing structures (whether they be mathematical or real things in the world like markets or E. coli). This contrasts with the approach taken by, e.g., MIRI, of listing desiderata and then searching for mathematical formalizations that satisfy them.

Two key properties that might be selected for are 1) the use of specific "abstractions" in modeling the world, and 2) modularity.


Abstractions are higher level things that people tend to use to describe things, like "Tree" and "Chair" and "Person". These are categories that contain lots of different things, but are useful for narrowing down things. Humans tend to use similar abstractions, even across different cultures / societies. The natural abstraction hypothesis states that a wide variety of cognitive architectures will tend to use similar abstractions to reason about the world. If some forms of the natural abstraction hypothesis were true, it would imply that we could use ordinary terms like "person" in our communication and instructions to an AI and expect the AI to generally have the same concept ("abstraction") of "person" that we do, without us needing to rigorously define "person". This could plausibly make aligning that AI easier.

The natural abstraction hypothesis seems plausible for physical objects in the world, and so it might be true for the inputs to human values. If so, it would be helpful for AI alignment because it would solve the ontology identification problem: if we can understand when environments induce certain abstractions, we can design an AI's environment so that it acquires the same abstractions as humans.


Many selection environments produce systems that exhibit modularity: for example, biological species have cells and organs and limbs, and companies have departments. We might predict that artificial neural networks are modular, but it turns out to be hard to identify modules in today's neural networks. If a method for identifying modularity in neural networks were found, it could improve our ability to interpret them.