What if we put the AI in a box and have a second, more powerful, AI with the goal of preventing the first one from escaping?

2 min read

Suggest changes in Google Docs

Preventing an AI from escaping by using a more powerful AI gets points for creative thinking, but unfortunately we would need to have already aligned the first AI. Even if the second AI's only terminal goal were to prevent the first AI from escaping, it could, for example, also have an instrumental goal of converting the rest of the universe into computer chips so that it would have more processing power to figure out how to best contain the first AGI.

It might be possible to try to bind a stronger AI with a weaker AI, but this is unlikely to work, as the stronger AI would have an advantage due to being stronger. This is essentially the same problem as humans (less intelligent agents) trying to control AI (more intelligent agents) in general.

Further, there is a chance that the two AIs end up working out a deal where the first AI decides to stay in the box and the second AI does whatever the first AI would have done if it were able to escape. Let's call the AI in the box the ‘prisoner’ and the AI outside the ‘guard’. You could imagine such AIs boxed like nesting dolls but there would always be a highest level where the guard AI is boxed by humans. This being the case, it is not clear what such a design buys you, as all the issues of containing an AI will still apply at this top level.

It’s worth noting that some AI researchers think that this kind of collusion could be avoided in general, and do believe that having multiple AI who are unable to collude may be safer than a single monolithic intelligence.

What if we put the AI in a box and have a second, more powerful, AI with the goal of preventing the first one from escaping?

In progress