Skip to yearly menu bar Skip to main content


Poster session B
in
Workshop: ICLR 2025 Workshop on GenAI Watermarking (WMARK)

Machine never said that: Defending spoofing attacks by diverse fragile watermark

Yuhang Cai · Yaofei Wang · Donghui Hu · Chen Gu


Abstract:

Misusing the large language models (LLMs) has intensified the need for robust generated-text detection through watermarking. Existing watermark methods prioritize robustness but remain vulnerable to spoofing attacks, where modified text retains detectable watermarks, falsely attributing malicious content to the LLM. We propose the Multiple-Sampling Fragile Watermark (MSFW), the first framework to integrate local fragile watermarks to defend against such attacks. By embedding context-dependent watermarks through a multiple-sampling strategy, MSFW enables two critical detection capabilities: (1) Modification detection via localized watermark fragility, where any modification disrupts adjacent watermark and reflectd through localized watermark extraction; (2) Generated-text detection using unaffected global watermarks. Meanwhile, our watermarking method is unbiased and improves the diversity of the output by the multiple-sampling strategy. This work bridges the gap between robustness and fragility in LLM watermarking, offering a practical defense against spoofing attacks without compromising utility.

Chat is not available.