VERIFY: A Novel Multi-Domain Dataset Grounding LTL in Contextual Natural Language via Provable Intermediate Logic
Abstract
Bridging the gap between the formal precision of system specifications and the nuances of human language is critical for reliable engineering, robotics, and AI safety, but it remains a major bottleneck. Prior efforts in grounding formal logic remain fragmented, resulting in datasets that are very small-scale (~2-5k examples), domain-specific, or translate logic into overly technical forms rather than context-rich natural language (NL). Thus, failing to adequately bridge formal methods and practical NLP. To address this gap, we introduce VERIFY, the first large-scale dataset meticulously designed to unify these elements. This dataset contains more than 200k+ rigorously generated triplets, each comprising a Linear Temporal Logic (LTL) formula, a structured, human-readable 'Intermediate Technical Language' (ITL) representation designed as a bridge between logic and text, and a domain-specific NL description contextualized across 13 diverse domains. VERIFY's construction pipeline ensures high fidelity: LTL formulas are enumerated and verified via model checking, mapped to the novel ITL representation using a provably complete formal grammar, and then translated into context-aware NL via LLM-driven generation. We guarantee data quality through extensive validation protocols, i.e., manual expert verification of 10,000 diverse samples. Furthermore, automated semantic consistency checks judged by Llama 3.3 confirmed an estimated >97% semantic correctness. From the initial experiments, we demonstrate VERIFY's scalability, logical complexity, and contextual diversity, significantly challenging standard models such as T5 and Llama 3.