A case for AI safety

Smarter-than-human AI may come soon and could lead to human extinction

TL;DR: Companies are racing to build smarter-than-human AI. Experts think they may succeed in the next decade. But rather than building AI, they’re growing it — and nobody knows how the resulting systems work. Experts argue over whether we’ll lose control of them, and whether this will lead to our demise. And although some decision-makers are talking about extinction risk, humanity does not have a plan.

Human-Level AI Is Approaching Rapidly

When an AI model called Deep Blue played chess champion Gary Kasparov in 1997, Newsweek called it ‘the brain’s last stand’ (Deep Blue won, obviously). Twenty years later, another AI model took on the much greater challenge of playing the Korean game ‘Go’ and beat Go champion Lee Sedol, one Korean newspaper reported it caused many Koreans to turn to drink.

Today, news of AI outperforming humans in yet another field, or even an entire category, is met with a collective shrug. Object recognition fell in 2016. The SAT and the bar exam fell in 2022 and 2023 respectively. Even the Turing Test, which stood as a benchmark for “human-level intelligence” for decades, was quietly rendered obsolete by ChatGPT in 2023. Our sense of what counts as impressive keeps shifting.

The pace is relentless. Breakthroughs that once required decades now happen quarterly. Research labs scramble to design new, harder benchmarks, only to watch as the latest AI models master them within months.

The big prize is now Artificial General Intelligence (AGI). Not narrow AI that excels at one task, such as playing chess, but AI that can learn anything a human can learn, and do it better.

Tech companies like OpenAI, Anthropic and Google DeepMind are spending billions in a high-stakes race to be the first to build AGI systems that match or exceed human capabilities across virtually all cognitive domains.

People involved with these companies — investors, researchers and CEOs — don't see this as science fiction. They're betting fortunes on their ability to build AGI within the next decade. Read that again: the leaders in this field, who have access to bleeding edge, unpublished systems, believe human-level AI could arrive before you next need to renew your passport. This could happen next decade, or it could happen next year.

What happens if they succeed? Such systems could think thousands of times faster than biological brains and never tire and we could create millions of them. They could access and process vast amounts of publicly available information in seconds, integrating knowledge from books, papers, and databases far faster than any human could. They could automate most intellectual work — including AI research itself.

That last point is crucial — AGI could accelerate AI development even further, leading to superintelligent AI. Don’t think of superintelligent AI as a single smart person, think instead of the entire team of geniuses from the Manhattan Project, working at 1000x speed.

This isn't distant speculation. Independent AI researchers see these predictions as worryingly plausible, and the companies building these systems openly discuss what happens after AGI, planning for a world where digital minds surpass their creators. Yet our social systems, our regulations, and our collective understanding remain focused on current AI capabilities, rather than on the transformative systems that may be developed in the coming years.

What does a society look like with millions of Oppenheimers working around the clock? What will we do when there is an AI that can do more or less any job far better than us, and never needs a lunch break? What happens when AI surpasses its creators both in numbers and capabilities? The truth is, we don’t know.

If things go well, these advances could help solve some of humanity's greatest challenges: disease, poverty, climate change, and more. But if these systems don't reliably pursue the goals we intend, if we fail to ensure they're aligned with human values and interests... it could be game over for humanity.

It’s time to pay attention.

AI Scientists Are Sounding the Alarm

Some of these predictions might sound like science fiction, but many of the world's leading AI experts are genuinely concerned that AI could pose an existential risk to humanity.

Among academic researchers who have spent decades advancing AI, some have shifted from making AI more capable to warning about its risks. Look at Geoffrey Hinton and Yoshua Bengio, both Turing Award winners widely known as "godfathers of AI". These aren't fringe voices — they're among the most cited AI researchers in the world.

  • Hinton left his position at Google in 2023 specifically to speak freely about these risks. When awarded the 2024 Physics Nobel Prize, he used this platform to warn about AI dangers.
  • Bengio has redirected his research focus toward ensuring AI systems remain aligned with human values and beneficial to society.

These concerns are also shared by those actively working to build AGI in industry:

  • Sam Altman (OpenAI CEO) has stated that if things go poorly with advanced AI, it could be "lights out for all of us".
  • Other prominent figures voicing concerns include Ilya Sutskever (OpenAI co-founder), Dario Amodei (Anthropic founder), Demis Hassabis and Shane Legg (Google DeepMind founders), and Elon Musk (xAI founder).

Many of these experts signed a statement declaring that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war".

As with any complex issue, there are dissenting voices. Some believe that AGI is many decades away, while others think that we can control AGI like any other technology. However, it's no longer reasonable to dismiss extinction risk from AI as a fringe concern when many pioneers of the technology are raising alarms.

So, what is it that has got these researchers so worried?

Let’s look at what makes AI different from other technologies we’ve created in the past, and what the potential risks are.

Modern AI Systems Remain Fundamentally Opaque

With AGI potentially arriving within a decade, we need to make sure that it works in the way we intend it to. This is sometimes called the ‘alignment problem’, essentially: can we ensure that this super powerful intelligence’s objectives are the ones we want it to have?

This is made harder by the fact that we don’t actually know how our most advanced AI systems work — and that makes ensuring their safety extremely difficult.

Modern AI isn’t built like regular software — it’s grown through training on massive datasets. Even though we have complete access to an AI's parameters — the billions of numbers that determine its behavior — we're not even close to understanding how these parameters work together to produce the AI's outputs.

Caption: The internal representation of a neural network identifying digits (source)

Imagine trying to understand how a human brain forms a thought by examining individual neurons — even if you can see all the parts and connections, the ways in which the low-level interactions add up to the higher-level behavior are so incredibly complex as to be inscrutable.

This lack of understanding is a serious problem when we try to make AI systems that reliably do what we want. Our current approach is essentially to provide millions of examples of the desired behavior, tweak the AI's parameters until its behavior looks right, and then hope the AI generalizes correctly to new situations.

The opacity of modern AI systems is particularly concerning because we're explicitly trying to build goal-directed AI. When we talk about an AI system having a "goal", we mean it consistently acts to push toward a particular outcome across different situations. For instance, a chess AI's 'goal' is to win because all its moves push toward checkmate, and it will adapt to respond to the moves of its adversary.

A chess AI is limited in the things it can do: it can read the game board and output legal moves, but it can't access its opponent’s match history or threateningly stare them down. But as AI systems become more capable and given more access to the world, they can take much broader sorts of actions in the world, and can successfully pursue more complicated goals. And of course companies are rushing to build these broader goal-directed systems because these systems are expected to be incredibly useful for solving real-world problems, and thus profitable.

And here's where the danger emerges: there's often a gap between the goals we intend to give AI systems and the goals they actually learn.

Even in fairly simple examples, the AI can be unhelpfully ‘creative’ when it comes to its goals. For example, researchers attempting to build an AI that could win at a boat racing game trained it to collect as many points as possible so that it would learn to steer the boat to victory. Well, it turns out that in this game, spinning around in a circle and crashing into things gets you points as well, so the boat ignored the race entirely!

Caption: The Coast Runners boat looping to pick up power-ups.

Or consider how OpenAI trained ChatGPT to avoid unsavoury responses, like teaching users how to make weapons or role-playing erotic encounters. Days after release, users circumvented these safeguards through creative prompting.

Despite significant research efforts, the alignment problem remains stubbornly difficult. And as AI capabilities grow, the stakes of getting it wrong become increasingly severe.

The Stakes Rise with System Capability

The consequences of misalignment in today’s AIs are usually manageable — an incorrect recommendation or an inappropriate response. We can mostly shrug these off. But this won't remain true for long.

Once AI systems are superhuman and act in situations that are very different from the situations they were trained in, the stakes change dramatically. Subtle misalignments between what we intended and what they learned could be magnified, potentially catastrophically so.

This isn't theoretical — we see it all around us. Social media wasn't built to damage teenage mental health; it was built to maximize engagement. The harm wasn't the goal; it was collateral damage from relentless optimization for something else entirely. The stronger the optimization, the more it prizes apart the gaps between the target and what we want.

Human values are intricate and nuanced. Almost every goal that an AI could optimize for would, when pursued to the limit, fundamentally conflict with what we care about in some important way. Even AI systems designed with seemingly beneficial goals can be dangerous if they miss crucial dimensions of what matters to us.

As AI capabilities grow, the alignment challenge becomes not just important but potentially existential. Almost any goal, when pursued with superintelligent capabilities, naturally gravitates toward seeking greater control and more resources. This isn't because the AI would be malicious — it would simply be pursuing its objectives with ruthless efficiency.

Imagine deploying a misaligned superintelligent AI system to "reduce global conflict". It might begin by mediating international disputes and suggesting novel diplomatic solutions. Remember, this AI is a better strategist than any human ever was, and can craft persuasive arguments for every party. It can also craft and execute massive propaganda campaigns (let's call them 'information initiatives') to ensure the public is on board. To ensure lasting peace, it might gradually gain control over military systems (to prevent accidental escalation), financial networks (to enforce economic incentives for cooperation), and communication infrastructure (to detect and prevent brewing conflicts). Each step would seem beneficial in isolation, but together they would give it unprecedented control over human society.

Not only is there a risk that AI seeks to take control over society, but there are also surprisingly credible reasons for thinking that it might, in the end, do us great harm as well. If humans tried to prevent the AI from gaining so much power, it might see us as a threat to it achieving its stated goals. If it required resources to achieve its goals, it might see us as in competition with it for those resources.

Think about the way we humans have interacted with the natural world: hunting entire species of animals to extinction when they threaten our crops or livestock, or simply destroying their habitats to collect wood or make space for our cities and farm land. How can we be sure super-capable AI won’t, in the end, treat us as we’ve treated other beings less capable than us?

The Dangerous Race to AI Supremacy

Solving AI alignment before AGI is deployed is critical. If AI companies succeed in building AGI in the next decade, this does not leave us a lot of time to solve alignment.

But, here’s the problem. While companies pour billions into making AI more capable, safety research remains comparatively underfunded and understaffed. We're prioritizing making AI more powerful over making it safe.

Companies face intense pressure to deploy their AI products quickly — waiting too long could mean falling behind competitors or losing market advantage. Moreover, AI companies continue pushing for increasingly autonomous, goal-directed systems because such agents are expected to create far greater economic value than non-agentic systems. But these agents pose significantly greater alignment challenges: they are inherently harder to align because of their expanded action-space, and their failures are much more dangerous since they can autonomously take harmful actions without human oversight. This creates a dangerous situation where AI systems that are both very capable and increasingly autonomous might be deployed without being thoroughly tested.

Many employees at the companies building these AIs think that everybody should slow down and coordinate, but there are currently no mechanisms that enable this, so they keep on racing towards ever more powerful systems.

A similar dynamic may emerge between nations racing to build AGI first. Developing ever more powerful AI could confer massive economic, military benefits, which would be a package deal with major societal disruption. Shortly after, someone develops a superintelligence that could outfight all of humanity. If misaligned, such an ASI would threaten everyone, including its creators, regardless of who else possesses the technology. Ironically, while China has signalled reluctance towards developing AGI, a common narrative in the US is that competition is inevitable, potentially creating a self-fulfilling prophecy.

So, what should we do?

Right now, there are disparate plans on how to solve this challenge, but none of them has created a consensus. Here are a few things you can do:

  1. Keep learning: this website is built to help you understand what AGI is and what its consequences might be for the world: the more we know, the more likely we are to steer AI in a positive direction. You can read our more detailed explanation to learn more, or if you already have a good understanding, you can visit our How Can I Help section.
  2. Share this information with others: part of the problem is that not many people realise that there is an issue to be solved. For many people, AI is still that app that turns their photos into Studio Ghibli cartoons. Helping others to be aware of the issue can help build pressure for this to be taken more seriously.


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—1970

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.