When less gives more: bias from small dataset can speed up training
Abstract
As data scaling approaches its natural limits, recent work has explored the potential of data reuse, whose impact on optimization remains poorly understood. This work investigates how data repetition affects learning efficiency in terms of compute. We investigate an inverse-scaling effect that using less data can lead to faster convergence, which is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. Instead, we argue that the speedup comes from appropriate layer-wise norm growth, which can be achieved faster when the dataset size is smaller. We provide theoretical justification by analyzing the benefits of sampling biases induced by small datasets, and we present extensive empirical evidence supporting this hypothesis. Together, our results highlight the potential of unlocking more efficiency gain by jointly considering different scaling axes.