Why should we prepare for AGI now, instead of waiting until it's closer?

Merge with this live article into here when this is finished . Possibly rename this oneWhy should we prepare for human-level AI technology now rather than decades down the line when it’s closer?

There are a number of reasons for not waiting until AGI is imminent to work on safety:

  • We can make progress on alignment before we have AGI

  • It’s hard to know when AGI will be imminent and the present time might not be particularly early

  • Adding safety as an afterthought does not yield good results

  • Safety solution must be tested by time as well as experts

  • Social and political preparations take time

  • An unknown amount of fundamental work might be needed

First, it’s worth noting that it is not necessary to wait for AGI to make progress on aligning it.

  • Progress on agent foundations does not depend on the capabilities of current AI.

  • Most interpretability work is done on toy models, which are much less capable than current day foundation models.[1]

  • There are known examples of specification gaming with current systems which computer scientists do not currently know how to solve in a systematic way.

  • The invention of RLHF in 2015, years before it was first used in LLMs, illustrates that conceptual breakthroughs can happen before the technology to use them is available.

It is also difficult to know when AGI is near. There’s no universal agreement on what would signal imminent AGI, and AGI might already be fairly close.

If we don’t have a way of understanding how these systems work, we may be unable to detect deception or other signs of misaligned goals.

One reason is that adding safety to a system as an afterthought does not yield good results. Hendrycks et al. explain that “if attention to safety is delayed, its impact is limited, as unsafe design choices become deeply embedded into the system,” citing a report for the Department of Defense which concludes that “approximately 75% of the most critical decisions that determine a system’s safety occur early in development”. They mention the internet as an example of a system which remains unsafe decades after it was built because it was not built to be safe.

Hendrycks continues:

Relying on experts to test safety solutions is not enough—solutions must also be age tested. The test of time is needed even in the most rigorous of disciplines. A century before the four color theorem was proved, Kempe’s peer-reviewed proof went unchallenged for years until, finally, a flaw was uncovered. Beginning the research process early allows for more prudent design and more rigorous testing. Since nothing can be done both hastily and prudently, postponing machine learning safety research increases the likelihood of accidents.

A second reason is that we want to ensure that we have a robust solution to AGI before AGI arrives, which is hard to guarantee since we don’t know for sure either when AGI will arrive or how long it will take to complete safety research. AI alignment may still need fundamental breakthroughs, which take time. If we delay fundamental research until advanced superintelligence is imminent, it will be too late. This problem is compounded by the possibility of an intelligence explosion: once an AI system becomes good enough at AI research and development, it might suddenly jump in capability without us having an opportunity to react. And even before becoming superhuman in all domains, AI might become dangerously capable in key domains like hacking and biotechnology. That means we may need well-developed alignment strategies before then.

Some types of research could be valuable far in advance of understanding exactly which type of system will be built. For example, basic research on the mathematical structure of agency can occur before we build such systems, much as Turing developed his theory of computation before people built computers. Similarly, developing general methods for identifying undesirable behavior can occur before we build the systems we want to test.

Early preparation has already proven to be helpful in the development of RLHF. This technique was proposed as a solution to alignment in 2015 and applied to toy examples, like teaching a virtual robot to do a backflip. It was used 5 years later to train modern LLMs like GPT3 and as a basis for Constitutional AI which is used in Anthropic’s Claude.

In addition to technical preparations, social and political preparations also take time. In making decisions about AI, we face the Collingridge dilemma: if we wait to see how it impacts society, it may become deeply embedded and hard to change. It takes time to properly formulate and build consensus around regulation, and this process has to be completed before harms become entrenched.


  1. OpenAI attempted to use GPT-4 to interpret the much simpler GPT-2 and Anthropic’s work on monosemanticity is done on a toy model with only one layer. ↩︎