Evaluating $n$-Gram Novelty of Language Models
William Merrill · Noah Smith · Yanai Elazar
Abstract
How novel is the text generated by language models (LMs) compared to their training corpus? In this work, we evaluate generation novelty by measuring the proportion of generated $n$-grams that did not appear in a model's pretraining data, for $1 \leq n \leq 100$. Focusing on the Pythia models, we perform a set of exploring how variables like model size and decoding techniques affect the novelty of LM-generated text. For instance, we find that $n$-grams generated with vanilla sampling tend to be more novel than $n$-grams in held-out text because held-out text contains duplicated $n$-grams from the training data. Larger LMs are consistently less novel than smaller LMs. Constrained decoding makes small $n$-grams less novel in LM-generated text than in held-out text. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release our search tool Rusty-DAWG to facilitate further research involving pretraining data.
Chat is not available.
Successful Page Load