Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Is Aligned Data the Optimal Training Data?

Elyas Obbad · Brando Miranda


Abstract:

In the pursuit of advancing powerful language models, the focus has primarily been onmodel size and the volume of training data. However, an often-overlooked aspect is thealignment of the training dataset with the test set. We hypothesize that the alignmentbetween the training and test datasets significantly impacts the performance of languagemodels, raising the question: Is the most aligned dataset the optimal dataset for training?We present a comprehensive analysis of various datasets, including LeanDojo, Proofnet,WikiText, and others, to understand how alignment influences model performance. Byfine-tuning multiple datasets with different alignment scores, we aim to quantify the re-lationship between dataset alignment and model efficacy. Our findings reveal strong cor-relations (R-squared up to 0.90) between GZIP alignment and cross-entropy (CE) loss,indicating that more aligned datasets consistently yield better performance. Furthermore,we demonstrate that increasing the number of tokens from a less aligned dataset, such asLeanDojo, will never achieve the same performance as an optimally aligned dataset likeProofnet, regardless of the training data volume. These insights underscore the importanceof dataset alignment in training robust and efficient language models. Our study suggeststhat understanding and optimizing dataset attributes beyond mere size, such as alignment,can lead to significant improvements in model performance. By focusing on alignment, wecan better address the question of whether the most aligned dataset is indeed the optimaldataset, paving the way for more effective and generalizable AI systems.

Chat is not available.