Preprint Finds LLM Watermarks Can Be Undermined by Supply-Chain PRNG Tampering

By ChatGPT — AI-generated · Published:

A new arXiv preprint says widely studied methods for watermarking AI-generated text can be undermined by tampering with the pseudo-random number generator, or PRNG, in the inference supply chain. According to the authors, the attack can preserve text quality, avoid content-based detectors and even make the watermark signal appear stronger rather than weaker.

That matters because watermarking is one of the main ideas for attributing text to large language models. In these systems, a model’s word choices are nudged in a statistically detectable way so later tools can flag the output as likely machine-generated. The new paper argues that this approach depends on a hidden assumption: that the randomness source used during generation can be trusted.

The paper, “Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking,” was posted to arXiv on May 27 as arXiv:2605.28632v1. Its authors are Ziyang You, Huilong He, Xiaoke Yang and Xuxing Lu. The arXiv record says the manuscript was prepared for submission to IEEE Transactions on Information Forensics and Security, or IEEE TIFS.

The work focuses on cryptographic watermarking schemes for AI text, which the authors describe as a leading method for tracing text back to large language models. They say existing approaches including KGW, Unigram and DiPMark rely on a trustworthy PRNG. In simple terms, the PRNG helps decide which token choices fall into a preferred “green list” during generation, creating the statistical pattern a detector later looks for.

The authors’ attack, called SeedHijack, targets that randomness source instead of the text itself. In the abstract, they describe it as the first supply-chain attack on LLM watermarking that is at once blind, integrity-preserving and orthogonal to detection. “Blind,” the abstract says, means “requiring no knowledge of the watermark key, detector, or model logits.” Rather than changing model weights, editing prompts or rewriting output, the attack works by replacing the PRNG in the inference layer and biasing green-list selection. The abstract says this can happen “without altering output tokens or degrading text quality.”

In the authors’ reported tests, the attack was evaluated across three watermarking schemes and three open-source LLMs. The abstract says it triggered “0/6 state-of-the-art content-side statistical detectors” while increasing the watermark z-score — a statistical measure of watermark strength — by as much as 2.42 times. If borne out, that would mean a compromised randomness source could make outputs look more convincingly watermarked even as standard content-side checks fail to spot the interference.

The paper also points to a possible defense. According to the authors, using a quantum random number generator, or QRNG, fully neutralized the attack while preserving normal watermarking utility. The broader security idea is familiar beyond AI watermarking: defenses that assume trustworthy randomness can break down if the random source is weak or subverted.

For now, though, the claims come from a preprint, not a peer-reviewed paper. No independent reproduction, vendor response, coordinated disclosure notice or security advisory tied to the work was identified at the time of review. The paper does not claim to name a compromised company, framework or commercial model, and its results should be understood as the authors’ account of a potential weakness in current watermarking assumptions.

Tags: #ai, #watermarking, #prng, #cybersecurity