Skip to yearly menu bar Skip to main content


( events)   Timezone:  
Workshop
Sat Jul 29 12:00 PM -- 08:00 PM (PDT) @ Ballroom C None
DMLR Workshop: Data-centric Machine Learning Research
Ce Zhang · Praveen Paritosh · Newsha Ardalani · Nezihe Merve Gürel · William Gaviria Rojas · Yang Liu · Rotem Dror · Manil Maskey · Lilith Bat-Leah · Tzu-Sheng Kuo · Luis Oala · Max Bartolo · Ludwig Schmidt · Alicia Parrish · Daniel Kondermann · Najoung Kim





Workshop Home Page

This is the third edition of highly successful workshops focused on data-centric AI, following the success of the Data-Centric AI workshop at NeurIPS 2021 and DataPerf workshop at ICML 2022. Data, and operations over data (e.g., cleaning, debugging, curation) have been continually fueling the success of machine learning for decades. While historically the ML community has focused primarily on model development, recently the importance of data quality has attracted intensive interest from the community, including the creation of the NeurIPS dataset and benchmark track, several data-centric AI benchmarks (e.g., DataPerf), and the flourishing of data consortiums such as LAION, the community’s attention has been directed to the quality of data used for ML training and evaluation. The goal of this workshop is to facilitate these important topics in what we call Data-centric Machine Learning Research, which includes not only datasets and benchmarks, but tooling and governance, as well as fundamental research on topics such as data quality and data acquisition for dataset creation and optimization.

Introduction and Opening (Opening Remarks)
Keynote 1: Andrew Ng (Landing AI) (Keynote)
Data-centric Ecosystem: Croissant and Dataperf - Peter Mattson (Google & MLCommons) (Talk)
Coffee break / networking break (Break)
Keynote 2: Mihaela van der Schaar (University of Cambridge) - Reality-Centric AI (Keynote)
Invited Talk 2: Olga Russakovsky (Princeton University) (Talk)
Invited Talk 3: Masashi Sugiyama (RIKEN & UTokyo) - Data distribution shift (Talk)
Lunch Break / networking break (Break)
Keynote 3: Isabelle Guyon (Google Brain) - Towards Data-Centric AutoML (Keynote)
Invited Talk 1: Dina Machuve (DevData Analytics) - Data for Agriculture (Talk)
Announcement and open discussion on DMLR (Selected members of DMLR Advisory Board) (Discussion Panel)
Panel Discussion (Discussion Panel)
Coffee break / networking break (Break)
Poster Session 1 (Poster Session - In Person)
Poster Session 2 (Virtual) (Poster Session - Virtual)
Adaptive Aggregated Drift Detector (Poster)
Improving multimodal datasets with image captioning (Poster)
Learning pipeline-invariant representation for robust brain phenotype prediction (Poster)
Characterizing the Impacts of Semi-supervised Learning for Weak Supervision (Poster)
Is Pre-training Truly Better Than Meta-Learning? (Poster)
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Poster)
Offline Reinforcement Learning with Imbalanced Datasets (Poster)
ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data (Poster)
STG-MTL: Scalable Task Grouping for Multi-Task Learning Using Data Maps (Poster)
Estimating label quality and errors in semantic segmentation data via any model (Poster)
Mobile Internet Quality Estimation using Self-Tuning Kernel Regression (Poster)
Birds of an Odd Feather: Guaranteed Out-of-Distribution (OOD) Novel Category Detection (Poster)
Prediction without Preclusion Recourse Verification with Reachable Sets (Poster)
Active learning for time instant classification (Poster)
Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least (Poster)
Speech Wikimedia: A 77 Language Multilingual Speech Dataset (Poster)
Suboptimal Data Can Bottleneck Scaling (Poster)
Bayesian Optimisation Against Climate Change: Applications and Benchmarks (Poster)
On the Usefulness of Synthetic Tabular Data Generation (Poster)
On the Reproducibility of Data Valuation under Learning Stochasticity (Poster)
Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana (Poster)
Data Integration for Driver Telematics with Selection Biases (Poster)
Training with Low-Label-Quality Data: Rank Pruning and Multi-Review (Poster)
Taming Small-sample Bias in Low-budget Active Learning (Poster)
Regularizing Neural Networks with Meta-Learning Generative Models (Poster)
Early Experiments in Scalable Dataset Selection for Self-Supervised Learning in Geospatial Imagery Models (Poster)
Detecting Errors in Numerical Data via any Regression Model (Poster)
THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech (Poster)
Data Similarity is Not Enough to Explain Language Model Performance (Poster)
Accelerating Batch Active Learning Using Continual Learning Techniques (Poster)
Contrastive clustering of tabular data (Poster)
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training (Poster)
Unitail: A Benchmark for Detecting, Reading, and Matching in Retail Scene (Poster)
Programmable Synthetic Tabular Data Generation (Poster)
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction (Poster)
How to Cope with Gradual Data Drift? (Poster)
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Poster)
How to Improve Imitation Learning Performance with Sub-optimal Supplementary Data? (Poster)
Algorithm Selection for Deep Active Learning with Imbalanced Datasets (Poster)
Inter-Annotator Agreement in the Wild: Uncovering Its Emerging Roles and Considerations in Real-World Scenarios (Poster)
On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training (Poster)
To Aggregate or Not? Learning with Separate Noisy Labels (Poster)
Transcending Traditional Boundaries: Leveraging Inter-Annotator Agreement (IAA) for Enhancing Data Management Operations (DMOps) (Poster)
DMOps: Data Management Operations and Recipes (Poster)
Training on Thin Air: Improve Image Classification with Generated Data (Poster)
Principlism Guided Responsible Data Curation (Poster)
Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning (Poster)
Making Scalable Meta Learning Practical (Poster)
Participatory Personalization in Classification (Poster)
DataCI: A Platform for Data-Centric AI on Streaming Data (Poster)
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value (Poster)
The Matrix Reloaded: A Counterfactual Perspective on Bias in Machine Learning (Poster)
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose? (Poster)
On Memorization and Privacy risks of Sharpness Aware Minimization (Poster)
Graphtester: Exploring Theoretical Boundaries of GNNs on Graph Datasets (Poster)
SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Poster)
Identifying Implicit Social Biases in Vision-Language Models (Poster)
A Skew-Sensitive Evaluation Framework for Imbalanced Data Classification (Poster)
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias (Poster)
Characterizing Risk Regimes for Safe Deployment of Deep Regression Models (Poster)
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models (Poster)
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (Poster)
Does Progress On Object Recognition Benchmarks Improve Real-World Generalization? (Poster)
Point Cloud Classification with ModelNet40: What is left? (Poster)
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models (Poster)
No Imputation without Representation (Poster)
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning (Poster)
Towards Declarative Systems for Data-Centric Machine Learning (Poster)
Towards an Efficient Algorithm for Time Series Forecasting with Anomalies (Poster)
Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning (Poster)
On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets (Poster)
Improve Model Inference Cost with Image Gridding (Poster)
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources (Poster)
A Privacy-Friendly Approach to Data Valuation (Poster)
Enhancing Time Series Forecasting Models under Concept Drift by Data-centric Online Ensembling (Poster)
RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting (Poster)
Knowledge Graph-Augmented Korean Generative Commonsense Reasoning (Poster)
Predicting Article Time Periods with Text2Time: A Transformer-based Approach (Poster)
Do Machine Learning Models Learn Statistical Rules Inferred from Data? (Poster)
TMARS: Improving Visual Representations by Circumventing Text Feature Learning (Poster)
Fair Machine Unlearning: Data Removal while Mitigating Disparities (Poster)
Addressing Discrepancies in Semantic and Visual Alignment in Neural Networks (Poster)
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data (Poster)
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning (Poster)
Investigating minimizing the training set fill distance in machine learning regression (Poster)
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline (Poster)
EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost (Poster)
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation (Poster)
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets (Poster)
Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion (Poster)
Prioritized Trajectory Replay: A Replay Memory for Data-driven Reinforcement Learning (Poster)
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation (Poster)
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors (Poster)
Understanding Unfairness via Training Concept Influence (Poster)
Promises and Pitfalls of Threshold-based Auto-labeling (Poster)
Why Do Self-Supervised Models Transfer? On Data Augmentation and Feature Properties (Poster)
Self-supervised Autoencoder for Correlation-Preserving in Tabular GANs (Poster)
PhysicsCAP: Natural Scene Understanding By Semantic Segmentation, CLIP And Physical Models Through Refined and Enriched Captions (Poster)
Put on your detective hat: What's wrong in this video? (Poster)
Ensemble Fractional Imputation for Incomplete Categorical Data with a Graphical Model (Poster)
Decoupled Graph Label Denoising for Robust Semi-Supervised Node Classification (Poster)
On Data Quality and Speed of Training: Bad Data Slows Training (Poster)
D4: Document Deduplication and Diversification (Poster)
Can Expert Demonstration Guarantee Offline Performance in Sparse Reward Environment? (Poster)
MultiLegalPile: A 689GB Multilingual Legal Corpus (Poster)
Uncovering Neural Scaling Law in Molecular Representation Learning (Poster)
Internet Explorer: Targeted Representation Learning on the Open Web (Poster)
LabelBench: A Comprehensive Framework for Benchmarking Label-Efficient Learning (Poster)
On Estimating the Epistemic Uncertainty of Graph Neural Networks using Stochastic Centering (Poster)