New arXiv Preprint Says Adaptive 'Metis' Jailbreak Achieved High Success Against Multiple LLMs, but Results Aren't Yet Reproducible
A new arXiv preprint says its adaptive “Metis” jailbreak system achieved high attack success rates against 10 large language models, including OpenAI’s o1 and GPT-5-chat, while using far fewer tokens than older attack methods.
The paper, titled “Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization,” is listed on arXiv as arXiv:2605.10067. The arXiv record says it was first posted May 11, 2026, and updated to version 3 on May 21. The authors listed on arXiv are Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang and Xuelong Li. Its arXiv metadata also includes the comment, “Accepted to the 43rd International Conference on Machine Learning (ICML 2026),” though that should be understood as a statement on the arXiv record, not an independently confirmed conference announcement.
In plain terms, the paper describes Metis as a jailbreak method that adapts as it goes. In AI safety research, “jailbreaking” means getting a model that is supposed to refuse harmful or disallowed requests to produce that material anyway. Rather than relying on a single clever prompt, the authors say Metis uses feedback from the model during an interaction to infer how the model’s defenses are working and then adjust its attack strategy in response.
That adaptive approach is the heart of the paper’s claim. In the abstract, the authors say they evaluated Metis on 10 models and that it achieved the highest average attack success rate among the methods they compared, at 89.2%. The same abstract reports a 76.0% attack success rate on OpenAI’s o1 and 78.0% on GPT-5-chat, under the paper’s own tested settings. It also says Metis reduced token costs by 8.2 times on average, with reductions of up to 11.4 times.
Those model names matter because they refer to newer, more heavily guarded systems. OpenAI’s o1 is a reasoning-oriented model family, while GPT-5-chat is a chat version in the company’s GPT-5 family. In red-team and safety testing, success against newer models can be more notable than success against older chatbots because frontier systems are generally designed with stronger safeguards and more elaborate refusal behavior.
For AI safety researchers, stronger jailbreak methods matter because they can expose where those safeguards break down. Red-teaming is the practice of probing systems for failure modes before attackers or ordinary users stumble into them first. A method that is both more successful and more token-efficient could make it easier to test models at scale, while also raising the bar for defenses.
But the paper’s claims come with important limits. This is a preprint, meaning the results should be treated as author-reported findings, not independently reproduced ones. Attack success rates in this field can vary sharply depending on how “success” is defined, what benchmark is used and whether tests were run against live commercial endpoints or other setups. The abstract does not provide all of those details. And as of May 23, no public code or dataset release was located on the paper’s arXiv page, which makes outside verification and reproducibility harder.
That does not make the paper unimportant. It does mean readers should separate the significance of the claim from the certainty of the result. What the preprint appears to add is evidence for a concern many safety researchers already have: static defenses may struggle when the attack itself is adaptive.
The abstract puts the point this way: “current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings.” In less technical language, the authors argue that if a jailbreak method can learn from a model’s responses in real time, model defenses may also need to adapt dynamically during inference rather than relying mainly on fixed guardrails.