Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Understanding Bias in Visual Datasets

Boya Zeng · Yida Yin · Zhuang Liu


Abstract:

A recent study (Liu & He, 2024) has shown that large-scale datasets are biased: they can be easily classified by modern neural networks. However, the concrete forms of visual bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach involves applying various transformations to extract semantic, structural, boundary, color, and frequency information from datasets and assess how each type of information contributes to their differences. We further unpack their semantic bias with object-level queries. Finally, we leverage natural language tools to generate detailed, open-ended descriptions of each dataset’s characteristics. Our work aims to help researchers understand existing large-scale datasets and build more diverse and representative ones in the future.

Chat is not available.