Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research

On Data Quality and Speed of Training: Bad Data Slows Training

Newsha Ardalani · Mostafa Elhoushi · Carole-Jean Wu


Abstract:

While the role of model architecture and hardware system on training speed is well-understood and appreciated, the role of data quality and quantity is often overlooked. In this paper, we quantify data quality across four dimensions, namely label correctness, information density, input coverage, and input space resolution, and conduct a data-driven analysis to understand the impact of data quality on training time as well as generalization error. We show that poor data quality can slow down the training process by one to two orders of magnitude on an ablation study conducted over various domains (vision and text), datasets (Kaggle's Cat vs Dog, CIFAR10, CIFAR100, ImageNet, WikiText, and GLUE) and model architectures (MobileNet_v2, ResNet18, VGG19, ViT, BERT, and OPT).

Chat is not available.