What are the differences between subagents and mesa-optimizers?

A subagent is an agent which combines with other subagents to compose a larger agent. For example, in shard theory, each shard is a subagent which pursues its own goal, and the goals of the system as a whole emerge from the negotiation between these shards.

A mesa-optimizer is similar to a subagent in that it also optimizes for its own goals. However, unlike a subagent, it is a separate trained model. The mesa-optimizer is shaped by the base optimizer, but is not part of it. The base optimizer might be an AI system looking to find the best solution to some problem defined by its human designers. In some cases, that solution will be a simple algorithm which is not an optimizer, but in other cases, the best solution is itself an optimizer – such a solution is a mesa-optimizer. This mesa-optimizer may be optimizing for a goal that differs from the problem definition given by the designers, and may also not be agent-like in a narrow sense.

Even though we have no particular reason to expect subagents to emerge from a process of gradient descent1, there is a more plausible story as to why mesa-optimizers would emerge. For example, if a program is designed to solve a problem in a very unpredictable environment, the optimal solution might be to create a planner which generates new solutions in real time. This planner is itself an optimizer since it searches through possible plans and selects the best one. The standard by which the planner judges how good a plan is serves as a proxy for the base optimizer’s goal, but is not necessarily identical to that goal.

In short, a subagent is an agent that is a part of an agent; a mesa-optimizer is an optimizer that is optimized by an optimizer.


  1. Since having some part of the model turn into an agent probably doesn’t have an advantage in achieving the base goal ↩︎