Skip to yearly menu bar Skip to main content


Poster

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer · Jonas Geiping · Yuxin Wen · Manli Shu · Khalid Saifullah · Kezhi Kong · Kasun Fernando · Aniruddha Saha · Micah Goldblum · Tom Goldstein

Halle B #244
[ ] [ Project Page ]
Thu 9 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. _Watermarking_ is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a $1\mathrm{e}{-5}$ false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.

Live content is unavailable. Log in and register to view live content