Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Evaluating $n$-Gram Novelty of Language Models

William Merrill · Noah Smith · Yanai Elazar


Abstract: How novel is the text generated by language models (LMs) compared to their training corpus? In this work, we evaluate generation novelty by measuring the proportion of generated $n$-grams that did not appear in a model's pretraining data, for $1 \leq n \leq 100$. Focusing on the Pythia models, we perform a set of exploring how variables like model size and decoding techniques affect the novelty of LM-generated text. For instance, we find that $n$-grams generated with vanilla sampling tend to be more novel than $n$-grams in held-out text because held-out text contains duplicated $n$-grams from the training data. Larger LMs are consistently less novel than smaller LMs. Constrained decoding makes small $n$-grams less novel in LM-generated text than in held-out text. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release our search tool Rusty-DAWG to facilitate further research involving pretraining data.

Chat is not available.