Two (narrow) heads are better than (an arbitrarily wide) one
Abstract
In this paper, we establish a dimension- and precision-independent impossibility result for a simplified transformer model. Due to their size, a comprehensive understanding of the internal operations of frontier large language models (LLMs) is beyond the reach of current methods, but research into small and interpretable models has proven successful. We study the representational limits of attention, the core of transformer models, through the lens of the Endpoint Selection Problem (ESP), a simple yet expressive learning task defined over arcs of a directed graph. Our main theoretical results are twofold: (i) 1-head, 1-layer, attention-only transformers cannot solve ESP on any graph containing a cycle, even with unbounded dimension and precision; but, all DAGs (Directed Acyclic Graph) are solvable with zero error (ii) in contrast, a 2-head, 1-layer, attention-only transformer can solve ESP on arbitrary directed graphs with constant embedding dimension and logarithmic precision. Prior lower bounds were conditional on bounds on dimension and precision. Through a transformation, we extend our impossibility result from ESP to the much studied 2-hop induction head problem. Further, we uncover a surprising connection to NP-completeness by showing that the optimal error of the 1-head transformer is em exactly related to the size of MAS (Maximum Acyclic Subgraph) and hence inapproximable. Finally, we validate our theory with experiments and observe that gradient-based optimization can reliably find 1-head solutions for DAGs and 2-head solutions for arbitrary graphs with cycles, whereas 1-head models struggle to reach the optimal solution in graphs with cycles. We believe that our techniques are of independent interest and have the potential to establish a new fine-grained hierarchy of transformer architectures, each with greater problem-solving power than the last.