Workshop
DataWorld: Unifying data curation frameworks across domains
Benjamin Feuer · Neha Hulkund · Sara Beery · Niv Cohen · Thao Nguyen · Ludwig Schmidt · Serena Yeung · Yuhui Zhang
Recently, data-centric research, which has historically taken a backseat to model-centric research, has assumed a central role in the machine learning community. Our workshop aims to explore data-centric methods and theory, with a particular emphasis on real-world data curation. By curation, we mean the set of actions taken by some curator(s) to transition from ideation to a complete dataset. Our topic is wide-ranging, with recent work studying everything from sourcing, to benchmarks.
One area that remains relatively underexplored is how data-centric methods can perform differently, depending on the modality and the domain of the data and the downstream application. Which lessons can be shared across domains and modalities, and which cannot? For example, a common part of the data pipeline involves data filtration. Filtration, in domains like medical imaging and wildlife camera traps, faces similar challenges including long-tailed distributions and natural distribution shifts (between hospitals and camera locations, respectively). However, the two domains differ in the types of distribution shift encountered (covariate vs. label vs. subpopulation) and dataset scale (there are generally more camera trap images than medical scans). Another example is the fact that most successful filtration methods in the recent DataComp benchmark tend to disproportionately remove images with non-English captions. Such methods not only degrade the performance on non-English benchmarks, but are also not generalizable to other domains and most real-world applications.
Our workshop will invite novel research which seeks to unify seemingly disparate frameworks for data curation; where this is impossible, we hope that the necessary trade-offs and domain-specific challenges will be made clearer.
Live content is unavailable. Log in and register to view live content