ICML Evaluating $n$-Gram Novelty of Language Models

Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Evaluating $n$ -Gram Novelty of Language Models

William Merrill · Noah Smith · Yanai Elazar

[ Abstract ]

Abstract: How novel is the text generated by language models (LMs) compared to their training corpus? In this work, we evaluate generation novelty by measuring the proportion of generated

$n$ -grams that did not appear in a model's pretraining data, for

$1 \leq n \leq 100$ . Focusing on the Pythia models, we perform a set of exploring how variables like model size and decoding techniques affect the novelty of LM-generated text. For instance, we find that

$n$ -grams generated with vanilla sampling tend to be more novel than

$n$ -grams in held-out text because held-out text contains duplicated

$n$ -grams from the training data. Larger LMs are consistently less novel than smaller LMs. Constrained decoding makes small

$n$ -grams less novel in LM-generated text than in held-out text. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release our search tool Rusty-DAWG to facilitate further research involving pretraining data.

Chat is not available.

Poster in Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Evaluating nn-Gram Novelty of Language Models

William Merrill · Noah Smith · Yanai Elazar

Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Evaluating $n$ -Gram Novelty of Language Models