Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Ouxiang Li · Yuan Wang · Xinting Hu · Huijuan Huang · Rui Chen · Jiarong Ou · Xin Tao · Pengfei Wan · XIAOJUAN QI · Fuli Feng
Abstract
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: \textbf{\textit{composition}} and \textbf{\textit{reasoning}}. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose \textbf{\textsc{T2I-CoReBench}}, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (\textit{instance}, \textit{attribute}, and \textit{relation}) and reasoning around the philosophical framework of inference (\textit{deductive}, \textit{inductive}, and \textit{abductive}), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual \textit{yes/no} questions to assess each intended element independently. In statistics, our benchmark comprises $1,080$ challenging prompts and around $13,500$ checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.
Successful Page Load