
The ARC-AGI-2 benchmark is designed to be a difficult test for AI models
Just_Super/Getty Images
The most sophisticated AI models present today are inadequate scores on new benchmarks designed to measure progress towards artificial general information (AGI), and brute-force computing power is not sufficient to improve as evaluators consider the cost of running the model.
There are many competing definitions of AGI, but it is generally thought to refer to AI capable of performing cognitive tasks that humans can do. To measure this, the ARC Awards Foundation previously began a test of reasoning ability called ARC-AGI-1. Last December, Openai announced that the O3 model scored highly in tests, with some asking if the company is approaching AGI achievement.
But now the new test, the ARC-AGI-2, has raised the bar. Although current AI systems on the market are difficult enough to not achieve a score of over 100 digits of 100 in tests, all questions have been answered by at least two people on less than two attempts.
in Blog post Introducing the ARC-AGI-2, ARC president Greg Kamradt said a new benchmark is needed to test skills that differ from previous iterations. “To beat it, you need to demonstrate both high levels of adaptability and high efficiency,” he writes.
The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on the ability to match the world’s leading PHD performance, but on the ability to complete simple tasks, such as replicating new image changes based on past examples of iconic interpretations. The current model is superior to “deep learning” measured by ARC-AGI-1, but not so good for seemingly simple tasks that require more challenging thinking and interaction with ARC-AGI-2. For example, Openai’s O3-low model won 75.7% on the ARC-AGI-1, but only 4% on the ARC-AGI-2.
This benchmark also adds a new dimension to measure AI capabilities by examining the efficiency of problem solving, as measured at the cost required to complete the task. For example, ARC paid a human tester $17 per task, while O3-low estimates that it would cost $200 for the same task.
“I think ARC-AGI’s new iteration, which now focuses on balancing performance and efficiency, is a major step towards a more realistic evaluation of the AI model,” he says. Joseph Imperial At the University of Bath, UK. “This is a sign that we are moving from a one-dimensional evaluation test that is not only focusing on performance, but also considering a decline in computing power.”
Models that can pass the ARC-AGI-2 should not only be very capable, but also be smaller and lighter, Imperial says. Model efficiency is a key component of the new benchmark. This helps address concerns that AI models are becoming more energy-intensive – Sometimes to the point of waste – to achieve much better results.
However, not everyone is convinced that the new measure will be beneficial. “The whole framing of this to test intelligence is not the correct framing.” Catherine Frick At Staffordshire University, UK. Instead, these benchmarks are extrapolated to imply general functionality across a set of tasks, simply by assessing the ability of AI to properly complete a single task or a set of tasks.
Working well with these benchmarks should not be seen as a major moment for AGI, Flick said:
And another question is what will happen if ARC-AGI-2 is given, or when it is given. Do you need yet another benchmark? “If they develop ARC-AGI-3, I guess they’ll add another axis to the graph [the] The minimum number of humans – whether expert or not, it will take a task to solve, in addition to performance and efficiency,” says Imperial. In other words, discussions about AGI rarely resolve immediately.
topic:
Source: www.newscientist.com