In FP16 mixed precision training, activations and gradients are stored as 16-bit floats. The issue: gradients often become to represent in FP16’s limited dynamic range (~5.96e-8 minimum normal value). When underflow happens, gradients become zero — and training stops learning.
Many have observed when switching from FP16 + loss scaling to BF16 loss scaling free, while gaining: loss scaling free
The biggest driver of "scaling-free" training is hardware evolution. In FP16 mixed precision training, activations and gradients