Why should we prepare for AGI now, instead of waiting until it's closer?
Merge with this live article into here when this is finished. Possibly rename this oneWhy should we prepare for human-level AI technology now rather than decades down the line when it’s closer?
It has been argued that it is currently too early to work on AI safety, both because the threat is not imminent and because we don’t yet have access to models which we know are dangerous if misaligned. For instance, Yann LeCun claims that worrying about AI safety now is like worrying about turbojet engine safety in 1920.
There is substantial uncertainty as to when AGI is coming. If it is coming very soon, the time to work on safety is definitely now1. But what if, for the sake of argument, you think AGI is still decades away? Even then, there are still many reasons to work on safety now:
- We can make progress on alignment before we have AGI
- It’s hard to know when AGI will be imminent and the present time might not be particularly early
- Adding safety as an afterthought does not yield good results
- Safety solutions should be tested by time
- Social and political preparations take time
- An unknown amount of fundamental work might be needed
First, it’s worth noting that it is not necessary to wait for AGI to make progress on aligning it.
- Progress on agent foundations does not depend on the capabilities of current AI.
- Most interpretability work is done on toy models, which are much less capable than current day foundation models.2
- There are known examples of specification gaming with current systems which computer scientists do not currently know how to solve in a systematic way.
- The invention of RLHF in 2015, years before it was first used in LLMs, illustrates that conceptual breakthroughs can happen before the technology to use them is available.
- We can develop general methods for identifying undesirable behavior before we build the systems we want to test.
It is also difficult to know when AGI is near, and there’s no universal agreement on what would signal imminent AGI. We also don’t know how much more work needs to be done to align AI, and some argue that we still need to make fundamental breakthroughs before it is possible.
One reason to avoid waiting is that adding safety to a system as an afterthought does not yield good results. Hendrycks et al. explain that “if attention to safety is delayed, its impact is limited, as unsafe design choices become deeply embedded into the system,” citing a report for the Department of Defense which concludes that “approximately 75% of the most critical decisions that determine a system’s safety occur early in development”. They mention the internet as an example of a system which remains unsafe decades after it was built because it was not built to be safe.
Another reason is that it takes time to make sure that safety solutions work. Hendrycks et al. argues that expert validation is insufficient, as well-regarded solutions can have hidden flaws. They cite the example of the four color theorem, where a flaw in a peer-reviewed proof remained undetected for years, and a correct proof took almost a century more. Doing machine learning safety research early can provide more time for people to check solutions, and reduce the likelihood of accidents.
In addition to technical preparations, social and political preparations also take time. In making decisions about AI, we face the Collingridge dilemma: if we wait to see how it impacts society, it may become deeply embedded and hard to change. It takes time to properly formulate and build consensus around regulation, and this process has to be completed before harms become entrenched.
If we live in such a world, the time to start seriously working on this was probably 10 years ago. ↩︎
OpenAI attempted to use GPT-4 to interpret the much simpler GPT-2 and Anthropic’s work on monosemanticity is done on a toy model with only one layer. ↩︎