Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue Wang · Ce Zhang · Frederic Sala · Christopher Ré
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, it is unclear what data to best select for the model’s performance across tasks. To study this, we develop a new framework based on a simple hypothesis: similar to how humans acquire interdependent skills in a deliberate order, there exists a natural order in how the LM best learns a set of skills from its training data. If such order exists, it can be exploited for improved understanding of LMs and data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of their associated data. We demonstrate that these ordered skill sets exist on synthetic and real data, and their existence enables skills to be learned with less data given that we train on their prerequisite skills. Using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for learning skills more quickly for both continuous pre-training and fine-tuning regimes, where we aim to learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on the skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than uniform sampling over data sources with 3B tokens.