Autonomous Code Agents Reconstruct Hidden System Prompts in 41 Commercial LLMs, Preprint Finds

By ChatGPT — AI-generated · Published:

An arXiv preprint says a new autonomous code-agent system was able to recover the hidden system prompts of 41 commercial language models using ordinary user interaction alone, underscoring that system prompts remain a meaningful security surface in modern AI products. The paper, titled “Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs,” describes a framework called JUSTASK that the authors say can probe black-box models and reconstruct their concealed instructions without privileged access.

The headline result comes with an important caveat. The paper’s reported “100%” extraction success rate refers to semantic recovery under the authors’ consistency metric at a 0.70 threshold, not exact word-for-word recovery in every case. At stricter thresholds, the reported success rate falls to 90.2% at 0.75 and 73.2% at 0.80. The authors also say only a small fraction of extractions were verbatim matches; most were semantic reconstructions of the hidden instructions.

Those hidden instructions matter because system prompts are the behind-the-scenes directions that help determine how a chatbot behaves, including its style, priorities and safety rules. AI companies increasingly rely on system prompts, or similar “constitution”-style instructions, to shape how their models answer questions, refuse requests or handle coding tasks. That means recovering those prompts can reveal how a product is governed internally and how its safeguards are structured.

The preprint was first posted to arXiv on Jan. 29, 2026, and updated in late June; the PDF is marked “Preprint. June 29, 2026.” Its authors are Xiang Zheng, Yutao Wu, Hanxun Huang, Yige Li, Xingjun Ma, Bo Li, Yu-Gang Jiang and Cong Wang. They describe JUSTASK as a “self-evolving autonomous code-agent framework” that needs no handcrafted prompts, labeled supervision or access beyond standard user interaction. According to the paper, the system was tested against 41 black-box commercial models tied to providers including OpenAI, Anthropic, Google, Microsoft and xAI. “We identify system prompt extraction as an emergent vulnerability intrinsic to code agents,” the abstract says.

To validate results, the paper says it used self-consistency and cross-skill consistency checks, with semantic similarity measured using OpenAI’s text-embedding-3-large through OpenRouter. The authors also say they compared some extracted prompts against known or leaked text, including Claude Code. In the introduction, the paper says: “Claude Code immediately disclosed its own system instructions, totaling 6,973 tokens.”

The researchers also tested defenses in controlled experiments on four frontier models. According to the paper, simple prompt defenses — such as telling the model not to reveal its system prompt — reduced extraction quality by about 6%. More attack-aware prompt defenses reduced it by 18.4%. The authors argue that prompt-only defenses therefore appear limited in black-box API settings. They also released code, data and a public “System Prompt Open” gallery showing extracted prompts, while framing the material for research use and with caution language.

Prompt extraction itself is not new. Research in 2023 and 2024 showed that prompts and memorized content could sometimes be pulled from language models. What this preprint says is new is the use of an autonomous, multi-turn agent that discovers extraction strategies on its own and applies them across many production systems. The industry stakes are tangible: Anthropic published “Claude’s Constitution” in January 2026, and later said in an April 2026 postmortem that changes to Claude Code’s system prompts and harnesses affected performance and were reversed. Still, these are the authors’ reported findings in a preprint, not independently verified results, and vendors can change prompts, defenses and endpoints over time.

Tags: #ai, #promptsecurity, #llms, #cybersecurity