Skip to yearly menu bar Skip to main content


( events)   Timezone:  
Workshop
Sat Jul 29 12:00 PM -- 08:00 PM (PDT) @ Ballroom C None
DMLR Workshop: Data-centric Machine Learning Research
Ce Zhang · Praveen Paritosh · Newsha Ardalani · Nezihe Merve Gürel · William Gaviria Rojas · Yang Liu · Rotem Dror · Manil Maskey · Lilith Bat-Leah · Tzu-Sheng Kuo · Luis Oala · Max Bartolo · Ludwig Schmidt · Alicia Parrish · Daniel Kondermann · Najoung Kim





Workshop Home Page

This is the third edition of highly successful workshops focused on data-centric AI, following the success of the Data-Centric AI workshop at NeurIPS 2021 and DataPerf workshop at ICML 2022. Data, and operations over data (e.g., cleaning, debugging, curation) have been continually fueling the success of machine learning for decades. While historically the ML community has focused primarily on model development, recently the importance of data quality has attracted intensive interest from the community, including the creation of the NeurIPS dataset and benchmark track, several data-centric AI benchmarks (e.g., DataPerf), and the flourishing of data consortiums such as LAION, the community’s attention has been directed to the quality of data used for ML training and evaluation. The goal of this workshop is to facilitate these important topics in what we call Data-centric Machine Learning Research, which includes not only datasets and benchmarks, but tooling and governance, as well as fundamental research on topics such as data quality and data acquisition for dataset creation and optimization.

Introduction and Opening (Opening Remarks)
Keynote 1: Andrew Ng (Landing AI) (Keynote)
Data-centric Ecosystem: Croissant and Dataperf - Peter Mattson (Google & MLCommons) (Talk)
Coffee break / networking break (Break)
Keynote 2: Mihaela van der Schaar (University of Cambridge) - Reality-Centric AI (Keynote)
Invited Talk 2: Olga Russakovsky (Princeton University) (Talk)
Invited Talk 3: Masashi Sugiyama (RIKEN & UTokyo) - Data distribution shift (Talk)
Lunch Break / networking break (Break)
Keynote 3: Isabelle Guyon (Google Brain) - Towards Data-Centric AutoML (Keynote)
Invited Talk 1: Dina Machuve (DevData Analytics) - Data for Agriculture (Talk)
Announcement and open discussion on DMLR (Selected members of DMLR Advisory Board) (Discussion Panel)
Panel Discussion (Discussion Panel)
Coffee break / networking break (Break)
Poster Session 1 (Poster Session - In Person)
Poster Session 2 (Virtual) (Poster Session - Virtual)
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (Poster)
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Poster)
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models (Poster)
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Poster)
On the Reproducibility of Data Valuation under Learning Stochasticity (Poster)
On Memorization and Privacy risks of Sharpness Aware Minimization (Poster)
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources (Poster)
Internet Explorer: Targeted Representation Learning on the Open Web (Poster)
Mobile Internet Quality Estimation using Self-Tuning Kernel Regression (Poster)
Early Experiments in Scalable Dataset Selection for Self-Supervised Learning in Geospatial Imagery Models (Poster)
Transcending Traditional Boundaries: Leveraging Inter-Annotator Agreement (IAA) for Enhancing Data Management Operations (DMOps) (Poster)
Unitail: A Benchmark for Detecting, Reading, and Matching in Retail Scene (Poster)
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors (Poster)
Principlism Guided Responsible Data Curation (Poster)
STG-MTL: Scalable Task Grouping for Multi-Task Learning Using Data Maps (Poster)
Identifying Implicit Social Biases in Vision-Language Models (Poster)
Addressing Discrepancies in Semantic and Visual Alignment in Neural Networks (Poster)
Making Scalable Meta Learning Practical (Poster)
Participatory Personalization in Classification (Poster)
Investigating minimizing the training set fill distance in machine learning regression (Poster)
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning (Poster)
TMARS: Improving Visual Representations by Circumventing Text Feature Learning (Poster)
Regularizing Neural Networks with Meta-Learning Generative Models (Poster)
Graphtester: Exploring Theoretical Boundaries of GNNs on Graph Datasets (Poster)
LabelBench: A Comprehensive Framework for Benchmarking Label-Efficient Learning (Poster)
A Skew-Sensitive Evaluation Framework for Imbalanced Data Classification (Poster)
To Aggregate or Not? Learning with Separate Noisy Labels (Poster)
Ensemble Fractional Imputation for Incomplete Categorical Data with a Graphical Model (Poster)
A Privacy-Friendly Approach to Data Valuation (Poster)
Algorithm Selection for Deep Active Learning with Imbalanced Datasets (Poster)
On Estimating the Epistemic Uncertainty of Graph Neural Networks using Stochastic Centering (Poster)
Self-supervised Autoencoder for Correlation-Preserving in Tabular GANs (Poster)
SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Poster)
Birds of an Odd Feather: Guaranteed Out-of-Distribution (OOD) Novel Category Detection (Poster)
Prediction without Preclusion Recourse Verification with Reachable Sets (Poster)
Active learning for time instant classification (Poster)
Suboptimal Data Can Bottleneck Scaling (Poster)
On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets (Poster)
Training on Thin Air: Improve Image Classification with Generated Data (Poster)
On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training (Poster)
Inter-Annotator Agreement in the Wild: Uncovering Its Emerging Roles and Considerations in Real-World Scenarios (Poster)
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction (Poster)
Understanding Unfairness via Training Concept Influence (Poster)
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation (Poster)
Prioritized Trajectory Replay: A Replay Memory for Data-driven Reinforcement Learning (Poster)
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline (Poster)
Contrastive clustering of tabular data (Poster)
Fair Machine Unlearning: Data Removal while Mitigating Disparities (Poster)
Do Machine Learning Models Learn Statistical Rules Inferred from Data? (Poster)
THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech (Poster)
Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning (Poster)
Towards Declarative Systems for Data-Centric Machine Learning (Poster)
Point Cloud Classification with ModelNet40: What is left? (Poster)
Does Progress On Object Recognition Benchmarks Improve Real-World Generalization? (Poster)
Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana (Poster)
Bayesian Optimisation Against Climate Change: Applications and Benchmarks (Poster)
On the Usefulness of Synthetic Tabular Data Generation (Poster)
Speech Wikimedia: A 77 Language Multilingual Speech Dataset (Poster)
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose? (Poster)
D4: Document Deduplication and Diversification (Poster)
On Data Quality and Speed of Training: Bad Data Slows Training (Poster)
Decoupled Graph Label Denoising for Robust Semi-Supervised Node Classification (Poster)
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value (Poster)
Training with Low-Label-Quality Data: Rank Pruning and Multi-Review (Poster)
Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning (Poster)
PhysicsCAP: Natural Scene Understanding By Semantic Segmentation, CLIP And Physical Models Through Refined and Enriched Captions (Poster)
Towards an Efficient Algorithm for Time Series Forecasting with Anomalies (Poster)
Promises and Pitfalls of Threshold-based Auto-labeling (Poster)
Can Expert Demonstration Guarantee Offline Performance in Sparse Reward Environment? (Poster)
Put on your detective hat: What's wrong in this video? (Poster)
MultiLegalPile: A 689GB Multilingual Legal Corpus (Poster)
Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion (Poster)
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning (Poster)
Characterizing Risk Regimes for Safe Deployment of Deep Regression Models (Poster)
No Imputation without Representation (Poster)
Data Integration for Driver Telematics with Selection Biases (Poster)
ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data (Poster)
Is Pre-training Truly Better Than Meta-Learning? (Poster)
How to Improve Imitation Learning Performance with Sub-optimal Supplementary Data? (Poster)
The Matrix Reloaded: A Counterfactual Perspective on Bias in Machine Learning (Poster)
Learning pipeline-invariant representation for robust brain phenotype prediction (Poster)
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models (Poster)
Adaptive Aggregated Drift Detector (Poster)
Knowledge Graph-Augmented Korean Generative Commonsense Reasoning (Poster)
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training (Poster)
Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least (Poster)
Offline Reinforcement Learning with Imbalanced Datasets (Poster)
Accelerating Batch Active Learning Using Continual Learning Techniques (Poster)
Estimating label quality and errors in semantic segmentation data via any model (Poster)
Detecting Errors in Numerical Data via any Regression Model (Poster)
Enhancing Time Series Forecasting Models under Concept Drift by Data-centric Online Ensembling (Poster)
Why Do Self-Supervised Models Transfer? On Data Augmentation and Feature Properties (Poster)
DMOps: Data Management Operations and Recipes (Poster)
Characterizing the Impacts of Semi-supervised Learning for Weak Supervision (Poster)
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias (Poster)
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation (Poster)
DataCI: A Platform for Data-Centric AI on Streaming Data (Poster)
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data (Poster)
RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting (Poster)
Taming Small-sample Bias in Low-budget Active Learning (Poster)
EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost (Poster)
Uncovering Neural Scaling Law in Molecular Representation Learning (Poster)
Predicting Article Time Periods with Text2Time: A Transformer-based Approach (Poster)
How to Cope with Gradual Data Drift? (Poster)
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets (Poster)
Data Similarity is Not Enough to Explain Language Model Performance (Poster)
Improve Model Inference Cost with Image Gridding (Poster)
Improving multimodal datasets with image captioning (Poster)
Programmable Synthetic Tabular Data Generation (Poster)