What is Aligned AI / Stuart Armstrong working on?

One of the key problems in AI safety is that there are many ways for an AI to generalize off-distribution, so it is very likely that any arbitrary generalization will be unaligned. See the model splintering post for more detail. AlignedAI's plan to solve this problem is as follows:

  1. Maintain a set of all possible extrapolations of reward data that are consistent with the training process.

  2. Pick among these for a safe reward extrapolation.

They are currently working on algorithms to accomplish step 1: see Value Extrapolation.

Their initial operationalization of this problem is the lion and husky problem. Basically: if you train an image model on a dataset of images of lions and huskies, the lions are always in the desert, and the huskies are always in the snow. So the problem of learning a classifier is under-defined: should the classifier be classifying based on the background environment (e.g. snow vs sand), or based on the animal in the image?

A good extrapolation algorithm, on this problem, would generate classifiers that extrapolate in all the different ways[4], and so the 'correct' extrapolation must be in this generated set of classifiers. They have also introduced a new dataset for this, with a similar idea: Happy Faces.

Step 2 could be done in different ways. Possibilities for doing this include: conservatism, generalized deference to humans, or an automated process for removing some goals like wireheading/deception/killing everyone.