How would we evaluate if an AI is an AGI?

AGI refers to an AI system that can do a wide range of cognitive tasks at a level comparable to humans. Measuring this is not easy – many tasks considered difficult have been claimed to be only possible for AI to achieve if it were equivalently intelligent to humans.

Before 2022, AI systems at the frontier of progress have been specialized or “narrow AI” – outperforming humans in specific domains like board games, but continued to be unable to do a broad range of tasks.

Significant progress has been made on many of these tasks, including computer vision, natural language understanding, and autonomous driving – but few people in retrospect consider the AI systems which are most capable at these problems to be generally intelligent. Some believed that outperforming humans at tasks such as Go would require general human-level intelligence, but in hindsight the first systems to outperform humans were not considered generally intelligent.[1] Problems considered difficult enough to require AGI have been informally known as AI-complete or AI-hard.

Since 2022, the development of LLMs and their successor multimodal systems have led some to argue that they constitute “AGI”, because they perform well on tasks they were not specifically trained for.

We outline here some proposed approaches to measuring AI capabilities against human capabilities, but this is by no means an exhaustive list.

The Turing Test

The classic ‘Imitation Game’, now commonly known as the Turing test. It tests a machine’s ability to exhibit intelligent behavior by seeing if a human evaluator can distinguish between it and a human. The Turing test is not precisely defined, and some AIs, whether ancient like ELIZA or modern such as GPT-4, have sometimes been able to convincingly mimic a human.

Forecasting resolution criteria

Metaculus, a forecasting platform, has two questions relating to predicting the date of AGI. Both sets of “resolution criteria” require an AI system to succeed at four tests of ability benchmarked against human performance.

This forecasting resolution criteria for a ‘weak AGI’ involves four tasks “easily completable by a typical college-educated human”:

  • Passing a Turing test of the type that would win the Loebner Silver Prize, which requires that the AI system can convince judges that the human is the AI.

  • Score 90% or more on a version of the “Winograd Schema Challenge” – a multiple choice test consisting of a specific type of question that requires knowledge about the world – where humans also score 90% or more.

  • Score 75th percentile on the mathematics section of a standard SAT exam using just images of the exam pages.

  • Explore all 24 rooms in the Atari game "Montezuma's revenge", using only visual inputs and standard controls, with less than the human equivalent of 100 hours of play.

This forecasting resolution criteria for the first general AI system involves four tasks completable by “at least some humans”:

t-AGI

The t-AGI framework, proposed by Richard Ngo, benchmarks the difficulty of a task by how long it would take a human to do it. For instance, an AI that can recognise objects in an image and answer trivia would be considered a 1-second-AGI, because it can do tasks that a human would take a second to do, while an AI that can develop new apps or review scientific papers would be considered a 1-month-AGI.


  1. AlphaGo, however, led to the more general AlphaZero, which was able to play multiple board games, and then MuZero, which was able to play both board games and Atari games. ↩︎