
Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It
Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with common tokens...






