ARC-AGI-2: Breakdown of Leading AI Model in Latest Artificial General Information Evaluation

SEI 245115646 — The ARC-AGI-2 benchmark is designed to be a difficult test for AI models

Just_Super/Getty Images

The most sophisticated AI models present today are inadequate scores on new benchmarks designed to measure progress towards artificial general information (AGI), and brute-force computing power is not sufficient to improve as evaluators consider the cost of running the model.

There are many competing definitions of AGI, but it is generally thought to refer to AI capable of performing cognitive tasks that humans can do. To measure this, the ARC Awards Foundation previously began a test of reasoning ability called ARC-AGI-1. Last December, Openai announced that the O3 model scored highly in tests, with some asking if the company is approaching AGI achievement.

But now the new test, the ARC-AGI-2, has raised the bar. Although current AI systems on the market are difficult enough to not achieve a score of over 100 digits of 100 in tests, all questions have been answered by at least two people on less than two attempts.

in Blog post Introducing the ARC-AGI-2, ARC president Greg Kamradt said a new benchmark is needed to test skills that differ from previous iterations. “To beat it, you need to demonstrate both high levels of adaptability and high efficiency,” he writes.

The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on the ability to match the world’s leading PHD performance, but on the ability to complete simple tasks, such as replicating new image changes based on past examples of iconic interpretations. The current model is superior to “deep learning” measured by ARC-AGI-1, but not so good for seemingly simple tasks that require more challenging thinking and interaction with ARC-AGI-2. For example, Openai’s O3-low model won 75.7% on the ARC-AGI-1, but only 4% on the ARC-AGI-2.

This benchmark also adds a new dimension to measure AI capabilities by examining the efficiency of problem solving, as measured at the cost required to complete the task. For example, ARC paid a human tester $17 per task, while O3-low estimates that it would cost $200 for the same task.

“I think ARC-AGI’s new iteration, which now focuses on balancing performance and efficiency, is a major step towards a more realistic evaluation of the AI model,” he says. Joseph Imperial At the University of Bath, UK. “This is a sign that we are moving from a one-dimensional evaluation test that is not only focusing on performance, but also considering a decline in computing power.”

Models that can pass the ARC-AGI-2 should not only be very capable, but also be smaller and lighter, Imperial says. Model efficiency is a key component of the new benchmark. This helps address concerns that AI models are becoming more energy-intensive – Sometimes to the point of waste – to achieve much better results.

However, not everyone is convinced that the new measure will be beneficial. “The whole framing of this to test intelligence is not the correct framing.” Catherine Frick At Staffordshire University, UK. Instead, these benchmarks are extrapolated to imply general functionality across a set of tasks, simply by assessing the ability of AI to properly complete a single task or a set of tasks.

Working well with these benchmarks should not be seen as a major moment for AGI, Flick said:

And another question is what will happen if ARC-AGI-2 is given, or when it is given. Do you need yet another benchmark? “If they develop ARC-AGI-3, I guess they’ll add another axis to the graph [the] The minimum number of humans – whether expert or not, it will take a task to solve, in addition to performance and efficiency,” says Imperial. In other words, discussions about AGI rarely resolve immediately.

topic:

Source: www.newscientist.com

What's Hot

From Epic Game Marathons to Military Helicopters: Highlights from Summer Game Fest 2025

Spain Unveils New Prehistoric Species of Cat

Exploring the Cosmic Landscape: Nueva Vizcaya, Philippines

Exploring the Limitations of AI Safety Management Practices

What is the likelihood of an asteroid impacting Earth?

Understanding Britain’s Debt Through Biscuits: How Labour MPs Embrace Viral Trends

Tesla Launches Affordable Model 3 in Europe Amid Criticism of Mask Sales

Horror Game Horses Banned: Is the Controversy Bigger Than You Think?

Top 5 Effective Strategies to Combat Hair Loss Explained by a Psychologist

Are You Eating Fiber at the Wrong Times? Insights from a Harvard Doctor

Ötzi’s Frozen Remains: Discovering Metabolically Active Microorganisms in Ancient Ice

Astronomers Discover Distinct Evidence of Exoplanet’s Magnetic Field

Ancient Oceans’ Oxygen Decline Predated End-Triassic Mass Extinction by Millions of Years

Top 4 Altcoins Unveiled by Expert for 100x Portfolio Growth: Blockchain News, Opinion, TV, Jobs

Blockchain experts forecast which tokens will generate profits

The Leading Platform for Seasoned Traders – Featuring Blockchain News, Insights, TV, and Job Listings

Darklume Fantasy Metaverse: Presale Now Available – Latest Blockchain Updates, Opinions, Television, and Job Listings

Sui collaborates with Google Cloud to drive Web3 advancement through improved security, scalability, and AI features

ARC-AGI-2: Breakdown of Leading AI Model in Latest Artificial General Information Evaluation

Top 5 Effective Strategies to Combat Hair Loss Explained by a Psychologist

Are You Eating Fiber at the Wrong Times? Insights from a Harvard Doctor

Ötzi’s Frozen Remains: Discovering Metabolically Active Microorganisms in Ancient Ice

Astronomers Discover Distinct Evidence of Exoplanet’s Magnetic Field

Ancient Oceans’ Oxygen Decline Predated End-Triassic Mass Extinction by Millions of Years

Discovering a Meteorite in Africa: Evidence of a Lost Giant Protoplanet Unveiled

Fishing Restrictions Lifted in Western Reservoirs: Drought Conditions Expected to Cause Drying

Unlocking the Universe: How the Electromagnetic Spectrum Reveals Cosmic Wonders

Chandra Uncovers the Turbulent History of Galactic Cluster Abell 2029

Scientists Caution Against Invasive Longhorn Mites Linked to Debilitating Aerlicia Infection

AI system used to detect UK benefits fraud exposed for bias | Universal Credit

UK Government to Renew Dispute with Apple Over Access to User Data | Data Protection

AI invents new battery design that decreases lithium usage by 70%

Human-Level AI is Inevitable: Harnessing the Power to Influence the Journey | Garrison Nice

Most Popular

Share Your Thoughts: Family YouTube Habits We Hope Never Happen

Halting Submissions: The Impact of NIH Budget Cuts on Scientific Journals

What's Hot

ARC-AGI-2: Breakdown of Leading AI Model in Latest Artificial General Information Evaluation

Related Posts