
Publicly Verifiable LLM Watermark Detection
December 24, 2025
As large language models become more widespread, distinguishing AI-generated text from human-written content has become a real problem. Watermarking is one of the most promising ideas: during generation, the model is nudged so that its output contains a subtle statistical “signature” that a detector can later pick up.
But there’s a catch: most watermark detectors rely on a secret key. If you reveal the key, anyone can forge watermarks or tailor attacks. If you keep it secret, the public can’t verify your claim that “this text is watermarked.”
So the natural question is:
Can we prove that watermark detection was run correctly without revealing the secret key?
This post explains how I did exactly that for Unigram watermark detection, by building a STARK prover using StarkWare’s Stwo library:
Implementation: https://github.com/raphaelDkhn/zunigram
You’ll see how Unigram detection turns into an Algebraic Intermediate Representation (AIR), so that anyone can verify that an LLM output is watermarked, while the detector’s secret key stays hidden.
What we’re proving
Given a token sequence s = (t₀, …, t_{n-1}), we want to prove the statement:
- “There exists a secret key
ksuch that, if we computePRF_k(t_i)for every token and count how many land in the green set, the final green count exceeds a detection threshold.”
In STARK terms:
- Public inputs: the token sequence, the threshold parameters, and the final claimed result
- Private inputs: the secret key
k - Witness: all intermediate values (PRF outputs, per-token flags, running sums, etc.)
The Unigram Watermarking Scheme
The Unigram watermark introduced by Zhao et al. (2023) is appealingly simple. Unlike k-gram watermarks that depend on token context, Unigram uses a fixed partition of the vocabulary into “green” and “red” tokens.
Generation
A pseudorandom function (PRF) keyed by a secret k maps each token ID to green or red. The model’s sampler is biased to prefer green tokens, creating a small but detectable imbalance.
Detection
Given suspect text tokens:
- For each token ID, compute
PRF_k(token_id) - Classify the token as green/red
- Count greens
|s|_G - Compare against the unwatermarked expectation
If the green fraction is γ (often γ = 0.5), then under the null hypothesis (unwatermarked text), the expected number of green tokens is γ·n. A common detection statistic is the z-score:
If z is above a chosen threshold, we declare the text watermarked.
Why prove detection in zero-knowledge?
In the normal setup, anyone verifying detection must either:
- See the secret key (bad: enables forging / key compromise), or
- Trust the detector (bad: unverifiable claims)
A proof fixes this: the detector can publish a cryptographic proof that:
- the PRF was computed correctly with some key
k, - the green count was computed correctly,
- and the detection threshold rule was applied correctly,
without revealing k.
A quick intro to Stwo
Stwo is StarkWare’s Rust implementation of a Circle STARK prover/verifier system. The mental model is:
- You describe your computation as a trace (a table of values).
- You write polynomial constraints (an AIR) that must hold across that trace.
- Stwo produces a STARK proof that “there exists a trace satisfying these constraints,” and the verifier checks it efficiently.
Two Stwo features that matters for this project:
1.Modular AIR composition
You can build separate AIR “components” (Poseidon2, range checks, Unigram accumulation) and connect them.
2.LogUp interactions (lookup arguments)
When one component produces pairs like (token, prf_output) and another component consumes them, LogUp enforces that both sides match—without wiring the entire thing into one monolithic trace.
Translating detection into AIR constraints
To prove Unigram detection, we need to constrain these operations:
- PRF evaluation per token:
PRF_k(token_id) - Green classification:
is_green = 1iff PRF output is in the green range - Accumulation: running count of green tokens
- Threshold check: final count exceeds (or matches) the claimed threshold rule
PRF choice: Poseidon2
For the PRF, I use Poseidon2, a zk-friendly hash. For each token ID, we compute:
Poseidon2 component (AIR)
Conceptually, the Poseidon2 AIR:
- takes
(k, token_id)as inputs, - runs the Poseidon2 permutation rounds,
- outputs
prf_output, - and exposes
(token_id, prf_output)through a LogUp table so that other components can consume it.
Each token requires one Poseidon2 invocation, which is the dominant cost.
Green classification via range check
A token is “green” if its PRF output, interpreted as a field element, falls below a threshold.
With γ = 0.5 over the Mersenne-31 field:
This is implemented with a range-check / comparison component that enforces “prf_output is less-than threshold” when is_green = 1, and the opposite case when is_green = 0.
The Unigram component
This is the component that:
- reads tokens,
- consumes the PRF outputs,
- computes
is_green, - and accumulates the running green count.
Trace structure
Main trace columns
token: current token IDis_green: binary flagis_padding: binary flag for padding rows (to reach a power-of-two trace length)green_count: running sum of green tokens
Interaction (LogUp) columns
- values for consuming
(token, prf_output)from the Poseidon2 component - values for binding public inputs
Constraints
In an AIR, constraints are polynomial identities that must hold at every row (or at boundaries). Below are the core ones.
1. Boolean flags
is_green and is_padding must be bits:
is_green * (is_green - 1) = 0
is_padding * (is_padding - 1) = 0
2. Accumulator initialization
On the first row, initialize the counter. I use is_first as a selector that is 1 on the first row and 0 elsewhere:
// On first row: green_count = is_green (unless padding)
is_first * (green_count - is_green * (1 - is_padding)) = 0
3. Accumulator update
On all non-first rows, the running sum must update correctly:
(1 - is_first) * (green_count - green_count_prev - is_green * (1 - is_padding)) = 0
Meaning:
Padding rows contribute zero.
4. Correct green classification
is_green must reflect the comparison prf_output < threshold. This is handled by range-check.
Binding the public inputs
A verifier needs to know what statement is being proven. In this construction, public inputs include:
- the token sequence,
- the final green count,
- and the threshold parameters.
These are bound using additional LogUp interactions that force selected trace values to match the claimed public values.
Performance
Benchmarks on my machine (Apple M3 Pro):
| Tokens | Prove Time | Verify Time |
|---|---|---|
| 256 | ~16ms | ~644µs |
| 512 | ~30ms | ~607µs |
| 1024 | ~90ms | ~787µs |
| 2048 | ~320ms | ~1.12ms |
For context, a typical LLM response is ~200–500 tokens. That means we can generate a proof in tens of milliseconds and verify it in under a millisecond.
What the verifier learns (and what they don’t)
After verifying the proof, the verifier is convinced that:
- The prover knows a secret key
k - For the provided token sequence, applying
PRF_kyields a green count exceeding the detection threshold - Therefore the text is watermarked (with high probability)
The verifier does not learn:
- the secret key
k - which individual tokens were classified green vs. red
That’s the intended “zero-knowledge flavor”: the proof reveals the conclusion without exposing the key or per-token classifications.
Important note about “zero-knowledge” in Stwo
As of the time of this writing, Stwo does not provide the "zero-knowledge" feature. "Zero-knowledge" here refers to the fact that the proof should not reveal any additional information other than the validity of the statement, which is not true for Stwo as it reveals to the verifier commitments to its witness values without hiding them by e.g. adding randomness. This reveals some information about the witness values, which may be used in conjunction with other information to infer the witness values.
Code
The full implementation is here:
https://github.com/raphaelDkhn/zunigram
References
- Provable Robust Watermarking for AI-Generated Text — Zhao et al., ICLR 2024
- Stwo Prover — StarkWare’s Circle STARK implementation
- Stwo Documentation — guide to building AIRs with Stwo
- Provable Watermark Extraction — Ingonyama article on provable watermark extraction