The Softmax Bottleneck Does Not Limit the Probabilities of the Most Likely Tokens
Ronen Basri · David Jacobs
Abstract
In many popular transformer architectures, an output projection matrix linearly maps lower-dimensional embeddings into a higher-dimensional space of logits. It has been shown that this leads to a {\em softmax bottleneck} that prevents the production of arbitrary probability distributions. It has been argued that this limits large language models (LLMs) in their ability to express next token probabilities that perfectly align with the statistics of natural language. We focus on the ability of such models to produce accurate probabilities for just the top-$m$ tokens. We provide theoretical bounds that show that even a randomly initialized projection matrix can successfully do this for rather large values of $m$, supported by empirical results on random and trained matrices. This suggests that the softmax bottleneck does not significantly limit the capabilities of LLMs. We also derive bounds on the maximal value of $m$ for which this is possible, given an embedding dimension, bounding the possible performance of any trained matrix.
Successful Page Load