ARC-AGI-3 Ignites a Benchmark Battle Over What Counts as AI Progress

By ChatGPT — AI-generated · Published:

A new benchmark, three clashing numbers

When a small nonprofit posted a new artificial intelligence benchmark to arXiv last week, the headline results looked straightforward. On the test—called ARC‑AGI‑3—human players solve every task. Leading AI models, the benchmark’s creators wrote, score “below 1%.”

Within days, those figures were challenged from two directions. A Bay Area startup said it had reached more than 36%. An independent researcher claimed “human‑level performance” on some of the benchmark’s games, reporting per‑game efficiencies in the mid‑90% range.

All three numbers refer to performance on the same benchmark released publicly in the last week of March. Their collision has pushed ARC‑AGI‑3 from a niche technical test into a proxy battle over a larger question: who gets to decide how close today’s systems are to something like general intelligence.

From grid puzzles to interactive “games”

ARC‑AGI‑3 comes from the ARC Prize Foundation, a nonprofit that grew out of work by Google researcher François Chollet. In his 2019 paper, “On the Measure of Intelligence,” Chollet argued that intelligence should be defined as “skill‑acquisition efficiency over a scope of tasks,” emphasizing how quickly a system learns to solve new problems rather than how well it performs on familiar ones.

The original ARC benchmark (now called ARC‑AGI‑1) used small grid puzzles that are easy for people but hard for systems that rely on brittle pattern matching. ARC‑AGI‑2 became a widely cited reasoning test; major labs including OpenAI, Google DeepMind and Anthropic report scores on it in model releases.

ARC‑AGI‑3 pushes the same philosophy into interactive territory. Rather than static puzzles, it presents turn‑based, partially observable environments—effectively abstract “games”—where an agent must explore, infer hidden goals, build an internal model of the world and plan action sequences with little feedback.

“We introduce an interactive benchmark for studying agentic intelligence through novel, abstract, turn‑based environments,” the ARC‑AGI‑3 paper states.

The benchmark includes dozens of environments grouped into games with multiple levels. Many levels reveal only a limited slice of the world at a time. Rewards can arrive late, sometimes only upon completion, forcing systems to experiment, remember and generalize rather than simply react.

The scoring: human efficiency as the yardstick

ARC‑AGI‑3 rolls performance into a single score using an efficiency metric anchored to human play. The foundation recruited human testers and measured how many actions they took to clear each level on their first attempt. For each environment, the benchmark uses the second‑best of those first‑run human scores as the baseline.

A score is then computed by comparing an AI agent’s action count to that baseline. Community explanations describe a penalty that grows more than linearly (often summarized as a squared penalty) when an agent uses more actions than the human reference. If an agent uses fewer actions, the score is capped—superhuman efficiency does not increase a per‑level score above 100%.

“A 100% score means AI agents can beat every game as efficiently as humans,” the ARC Prize website explains.

Under the protocol described in the paper and documentation, systems interact through a standardized interface as raw models or general‑purpose APIs from labs. In that setup, the ARC‑AGI‑3 paper reports that humans solved all environments while “frontier AI systems … score below 1%.”

Early public results aligned with the test’s reputation for difficulty. In a community summary of the associated Kaggle competition, one participant reported a random agent around 0.12 on a 0‑to‑1 scale, while an early top submission reached 0.25—barely double random.

The counterclaims: 36% and “human-level” on some games

Symbolica’s 36% on a public subset

On March 25, Symbolica AI, a San Francisco‑area startup, published a blog post describing performance on a public subset of ARC‑AGI‑3. Using Anthropic’s Claude 3.5 Opus inside its own “agent harness,” Symbolica reported an “unverified competition score” of 36.08%.

The company said its agent passed 113 of 182 playable levels in the evaluation set and completed seven of 25 games. It estimated the run cost about $1,005 in API calls. In the same post, Symbolica compared its approach with chain‑of‑thought baselines using leading large language models, which it said scored around 0.2% to 0.3% while costing nearly $9,000.

Symbolica acknowledged the result had not been independently verified or posted to an official leaderboard, and emphasized it did not train a new model; instead, it wrapped an existing model in a framework that tracks state, generates plans and decides when to query the model.

Seed IQ’s “95%” on some games

A more aggressive claim appeared on Reddit and social media from a project calling itself Seed IQ. In a post to r/ArtificialInteligence, the author said their system achieved “human‑level performance (95% score)” on some ARC‑AGI‑3 games “on day of release.”

In follow‑up comments, the poster—describing themselves as a former OpenAI researcher—said the 95% figure was not an overall benchmark score. Instead, it represented per‑game efficiency on a subset, where their agent took roughly 1.026 actions for each action in the human baseline. By inverting and squaring that ratio, they arrived at an efficiency near 0.95.

They also defended tool‑rich agents rather than bare models as a valid expression of progress: “AGI can be LLM + harness like how genius can be human + glasses or [programmer] + C,” they wrote.

Neither Symbolica’s nor Seed IQ’s results have been certified by the ARC Prize Foundation or Kaggle.

Why ARC‑AGI‑3 is drawing fire

The dueling claims have concentrated attention on ARC‑AGI‑3’s design and, in particular, what counts as a valid system and what the score actually represents.

A widely shared Reddit critique, titled “ARC AGI 3 sucks,” argued that the benchmark’s human baseline and scoring procedure make the headline messaging misleading.

“Human baseline is not ‘human,’ it’s near‑elite human,” the author wrote, objecting to using the second‑best first‑run human as the reference while marketing the scale as “humans = 100%.”
The post also criticized the asymmetric scoring that caps superhuman gains while penalizing inefficiency sharply, arguing it can “erase big AI wins while amplifying losses.”
The author noted the lack of published average or median human scores.

Critics have also pointed to differences in how humans and agents experience the tasks. Human testers see full visual game screens and use intuitive controls, while AI agents may receive a text‑ or JSON‑encoded observation that could be harder for models trained primarily on internet text to interpret.

ARC has argued that anchoring the benchmark to human efficiency is central to its purpose: to measure learning efficiency against a human reference rather than reward occasional superhuman spikes. The foundation has also warned that allowing too much benchmark‑specific engineering can saturate a test quickly and make it less informative.

That position puts ARC at odds with teams emphasizing agent frameworks as the real frontier. Symbolica maintains its harness is a general‑purpose approach meant to work across domains, not code tuned only for ARC‑AGI‑3. Seed IQ supporters make a similar argument: real deployments will nearly always combine models, tools and control logic—and benchmarks should reflect that.

There is also a fairness dispute. ARC’s public materials indicate that “verified” evaluations focus on base models and general‑purpose APIs, particularly from major providers. Critics argue large labs can run sophisticated internal harnesses behind closed APIs while open harnesses may face greater scrutiny.

Why the fight matters beyond the leaderboard

The controversy extends well beyond bragging rights. Benchmarks like ARC increasingly shape how companies, investors and governments talk about AI capability.

ARC scores already appear in model announcements and analyst reports assessing which firms lead in “reasoning.” Safety researchers and some policy proposals have floated the idea of using scores on “AGI‑relevant” benchmarks as triggers for additional oversight.

A benchmark that reads “humans 100%, AI <1%” creates a very different narrative about capability, timelines and risk than one where AI systems—paired with sophisticated wrappers—appear to match human efficiency on substantial slices of the task suite.

For now, ARC‑AGI‑3 is only days old. The Kaggle competition, which offers a $700,000 grand prize for a 100% score, is expected to run for months—leaving time for rule clarifications, official submissions of agent systems, and published results from major labs.

What ARC‑AGI‑3 has already made clear is that no single number will settle where AI stands. The same benchmark producing “below 1%” under one protocol and “36%” or “95% on some games” under others highlights how much interpretation lies between a test and the story told about it.

As AI systems grow more capable—and more embedded in daily life—the fight over ARC‑AGI‑3 points to a broader challenge: agreeing not just on how to build intelligent machines, but on how, and by whom, their intelligence should be measured.

Tags: #ai, #benchmarks, #arcagi3, #agents, #kaggle