Almost Bayesian: Dynamics of SGD Through Singular Learning Theory
Max Hennick · Stijn De Baerdemacker
Abstract
The nature of the relationship between Bayesian sampling and stochastic gradient descent in neural networks has been a long-standing open question in the theory of deep learning. We shed light on this question by modeling the long runtime behaviour of SGD as diffusion on porous media. Using singular learning theory, we show that the late stage dynamics are strongly impacted by the degeneracies of the loss surface. From this we are able to show that under reasonable choices of hyperparameters for vanilla SGD, the local steady state distribution of SGD (if it exists) is effectively a tempered version of the Bayesian posterior over the weights which accounts for local accessibility constraints.
Successful Page Load