Preprint Claims Perfect Scores by Multi‑Agent AI on Multiple Math Competitions; Results Unverified
A newly posted arXiv paper is making an unusually sweeping claim in AI math performance: that a multi-agent system called STAR-PólyaMath achieved perfect scores on several major competition-style benchmarks. But as of May 20, those results were still the authors’ claims in a preprint, not independently replicated or externally confirmed.
That is what makes the paper notable. The reported gains span multiple prestigious math tests at once, including the American Invitational Mathematics Examination, or AIME, for high school students, and the Putnam, a leading undergraduate competition. If confirmed, the results would mark a significant jump on benchmarks widely used to test mathematical reasoning in AI systems.
The paper, titled “STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision,” was posted to arXiv as arXiv:2605.19338v1. The arXiv record shows it was submitted Tuesday, May 19, at 04:20:43 UTC. It lists six authors: Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang and Yinpeng Dong.
In the paper, the authors say STAR-PólyaMath reached state-of-the-art results on eight math competition benchmarks, including AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026 and USAMO 2026. The abstract makes the headline claim directly: “It obtains perfect scores on AIMEs, Putnam, and HMMT.”
The abstract also says the system’s largest advantage over a baseline came on MathArena Apex 2025, where the authors report a score of 93.75% compared with 80.21% for GPT-5.5, a publicly released OpenAI model family introduced in April 2026. MathArena is an active evaluation platform that tracks model performance on recent competition problems and is widely used in public comparisons.
The system described in the paper is a multi-agent setup rather than a single model prompt. According to the paper and the project’s public GitHub repository, STAR-PólyaMath uses a Python orchestrator and specialized agents for reasoning, verification and higher-level strategic oversight. The repository README says the framework “couples a Python orchestrator with three LLM agents — a Reasoner, a Verifier, and a persistent Meta-Strategist.”
That public repository, Julius-Woo/STAR-PolyaMath, is one reason the claims are likely to draw close attention. The repository exists and includes code as well as a problems directory with benchmark folders named AIME2025, AIME2026, Apex2025, ApexShortlist, HMMT2026, IMO2025, Putnam2025 and USAMO2026. Those are the same families of tests highlighted in the paper’s reported results.
At the same time, public code is not the same as public verification. The repository documents reproducibility constraints, including default model identifiers set to claude-opus-4.7 for the reasoner, verifier and meta-strategist roles. The setup also depends on GitHub Copilot CLI, with a bring-your-own-key option for routing to providers such as Azure, OpenAI and Anthropic.
As of May 20, no independent journalistic coverage, third-party replication or outside public confirmation of the paper’s headline benchmark results was identified in the available source material. That leaves STAR-PólyaMath as a high-profile and now more testable claim, but still an unverified one whose reported performance will need outside reproduction, including access to the same model backends and tooling described by the authors.