Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

Measuring Diversity in Synthetic Datasets

Yuchang Zhu · Huizhe Zhang · Bingzhe Wu · Jintang Li · Zibin Zheng · Peilin Zhao · Liang Chen · Yatao Bian

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.

Lay Summary

The evaluation of diversity in synthetic datasets produced by large language models (LLMs) presents a crucial challenge in facilitating their effective utilization. To address this challenge, we introduce DCScore, a novel method for evaluating diversity from a classification perspective.Our investigation reveals that the fundamental determinant of synthetic dataset diversity evaluation lies in sample discriminability—a property inherently addressed by classification methodologies. Consequently, we formalize diversity evaluation as a sample classification task through our proposed DCScore framework. Surprisingly, we found that DCScore worked well in the diversity evaluation of synthetic datasets while presenting lower computational costs.With the increasing usage of synthetic datasets in next-generation LLM training, evaluating data quality—particularly regarding diversity—emerges as a critical requirement. DCScore provides an effective solution for diversity evaluation, thereby promoting more widespread and reliable use of synthetic data in advanced language modeling applications.

Video

Chat is not available.