Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

Layer Normalization Improves Length Generalization

Ruining Li · Gabrijel Boduljak · Jinghao Zhou

Keywords: Transformers attention long-context generalisation

Project Page [ OpenReview]

Abstract

It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time.This raises the question of whether Transformer models are real _reasoning_ engines, despite their impressive abilities in mathematical problem solving and code synthesis.In this paper, we offer a _vanishing variance_ perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the $\operatorname{argmax}$ retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention output leads to significantly better length generalization.Our analyses attribute this improvement to a reduction---though not a complete elimination---of the distribution shift caused by vanishing variance.

Video

Chat is not available.