Preprint Says AI Model SU-01 Reached ‘Gold-Medal-Level’ on Olympiad Problems — Results Not Independently Certified
A new technical report on arXiv says a model called SU-01 can reach “gold-medal-level” performance on elite math and physics Olympiad problems, but the claim comes from the paper’s authors in a preprint, not from an officially certified competition result.
The paper, “Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling,” was submitted to arXiv on May 13, 2026, as a 77-page technical report listed as arXiv:2605.13301v1. The arXiv entry lists Runzhe Zhan as the submitter contact. Because it is a preprint, the work has not been published in a peer-reviewed journal or conference based on the information provided in the arXiv listing.
According to the abstract, the researchers trained SU-01 from what they describe as a “30B-A3B backbone” and present it as “a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver.” The central claim in the abstract is that the system is “achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025.”
If borne out, that would matter because Olympiad math and physics problems are among the hardest public tests of step-by-step reasoning. The broader race in AI research has increasingly focused on whether models can do more than answer short benchmark questions — specifically, whether they can sustain long, structured solutions similar to contest proofs and physics derivations.
The authors say they got there with a relatively straightforward three-part training recipe. In plain terms, the abstract describes supervised fine-tuning — additional training on examples — using a “reverse-perplexity curriculum,” followed by a two-stage reinforcement learning process that starts with verifiable rewards and then shifts to proof-level reinforcement learning, plus test-time scaling, meaning the model uses more computation while generating answers.
The abstract says training used about 340,000 trajectories shorter than 8,000 tokens, followed by 200 reinforcement learning steps. It also says the model “supports stable reasoning on difficult problems with trajectories exceeding 100K tokens,” suggesting the system can produce extremely long chains of text while working through difficult solutions.
Still, the key point for readers is verification. The paper’s performance claims are author claims presented in an arXiv preprint. Based on the available arXiv information, there is no indication that Olympiad organizers independently certified the reported scores. That distinction matters in this corner of AI research, where evaluation can depend on how answers are formatted, scored and selected.
As a comparison point, DeepMind separately said its 2025 International Mathematical Olympiad result was officially graded by IMO coordinators before describing it as gold-medal standard. This SU-01 preprint, based on the arXiv entry, does not make that kind of certification claim.
Even with that caveat, the report fits a clear trend from 2024 through 2026: research groups are pushing models toward Olympiad-style reasoning by combining supervised fine-tuning, reinforcement learning, verifiable rewards and extra inference-time compute. The significance is not that AI has definitively “won” Olympiads, but that labs are increasingly using these contests to argue their systems can handle very long, rigorous reasoning.