DataPerf: Benchmarking Data for Data-Centric AI

Workshop

DataPerf: Benchmarking Data for Data-Centric AI

Lora Aroyo · Newsha Ardalani · Colby Banbury · Gregory Diamos · William Gaviria Rojas · Tzu-Sheng Kuo · Mark Mazumder · Peter Mattson · Praveen Paritosh

Ballroom 3

Fri 22 Jul, 5:45 a.m. PDT

[ Abstract ] Workshop Website

This workshop proposal builds on the success of the 1st Data-Centric AI Workshop organized at NeurIPS 2021 (which attracted more than 160 submissions and close to 200 participants) and expands the effort to engage the deeplearning.ai community with the active interdisciplinary MLCommons community of practitioners, researchers and engineers from both academia and industry by presenting the current state-of-the-art, work-in-progress and a set of open problems in the field of benchmarking data for ML. Many of these areas are in a nascent stage, and we hope to further their development by knitting them together into a coherent whole. We seek to drive progress in addressing these core problems by promoting the creation of a set of benchmarks for data quality and data-related algorithms. We want to bring together work that pushes forward this new view of data-centric ML benchmarks, e.g. the initiatives at MLCommons, a non-profit that operates the MLPerf benchmarks that have become standard for AI chip speed but also others including Dynabench, OpenML, data-centric AI hub, etc. We envision MLCommons as providing a framework and resources for the evolution of benchmarks in this space, and our workshop as showcasing the best innovations revealed by those benchmarks and providing a focus event for the community submitting to them.A huge amount of innovation — in algorithms, ideas, principles, and tools — is needed to make data-centric AI development efficient and effective. We hope that this workshop will help spark that innovation.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 5:45 a.m. - 6:00 a.m.	Welcome ( Welcome ) > SlidesLive Video	🔗
Fri 6:00 a.m. - 6:30 a.m.	The Data-Centric AI Competition ( Keynote ) > SlidesLive Video	Andrew Ng 🔗
Fri 6:30 a.m. - 6:45 a.m.	Open Images: Lessons Learned from Collecting and Annotating 9M images ( Invited Talk ) > SlidesLive Video	Jordi Pont-Tuset 🔗
Fri 6:45 a.m. - 7:00 a.m.	Did We Forget about the Canonical Source of Variance in Machine Learning Pipelines? ( Invited Talk ) > SlidesLive Video	Xavier Bouthillier 🔗
Fri 7:00 a.m. - 7:15 a.m.	Coffee Break	🔗
Fri 7:15 a.m. - 7:30 a.m.	Embracing Subjectivity In Machine Learning Benchmarks ( Invited Talk ) > SlidesLive Video	Kurt Bollacker 🔗
Fri 7:30 a.m. - 7:45 a.m.	Responsible Evaluation Framework ( Invited Talk ) > SlidesLive Video	mona Diab 🔗
Fri 7:45 a.m. - 7:50 a.m.	Data Excellence for Responsible AI ( DataPerf Talk ) > SlidesLive Video	Lora Aroyo 🔗
Fri 7:50 a.m. - 7:55 a.m.	DataPerf Effort ( DataPerf Talk ) > SlidesLive Video	Praveen Paritosh 🔗
Fri 7:55 a.m. - 8:00 a.m.	DataPerf Challenges ( DataPerf Talk ) > SlidesLive Video	Cody Coleman · Mark Mazumder · Colby Banbury 🔗
Fri 8:00 a.m. - 8:05 a.m.	Short Break	🔗
Fri 8:05 a.m. - 8:08 a.m.	Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP ( Short Talk ) > SlidesLive Video	Thao Nguyen 🔗
Fri 8:08 a.m. - 8:11 a.m.	Metadata Representations for Queryable ML Model Zoos ( Short Talk ) > SlidesLive Video	Ziyu Li 🔗
Fri 8:11 a.m. - 8:14 a.m.	Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics ( Short Talk ) > SlidesLive Video	Sara Hooker 🔗
Fri 8:14 a.m. - 8:17 a.m.	Interpretable Distribution Shift Detection using Optimal Transport ( Short Talk ) >	Neha Hulkund 🔗
Fri 8:17 a.m. - 8:20 a.m.	Data Budgeting for Machine Learning ( Short Talk ) > SlidesLive Video	Weixin Liang · James Zou 🔗
Fri 8:20 a.m. - 8:23 a.m.	Data Sculpting: Interpretable Algorithm for End-to-End Cohort Selection ( Short Talk ) > SlidesLive Video	Ruishan Liu · James Zou 🔗
Fri 8:23 a.m. - 8:26 a.m.	Data Augmentation Techniques for Speech Error Correction ( Short Talk ) > SlidesLive Video	James Ren 🔗
Fri 8:26 a.m. - 8:29 a.m.	Data-Centric AI Infra 2.0 ( Short Talk ) > SlidesLive Video	Nikon Rasumov-Rahe 🔗
Fri 8:29 a.m. - 8:32 a.m.	Beyond Hard Labels: Investigating data label distributions ( Short Talk ) > SlidesLive Video	Lars Schmarje 🔗
Fri 8:32 a.m. - 8:35 a.m.	An Operational Metrics Framework for ML Data ( Short Talk ) > SlidesLive Video	Anoop Sinha 🔗
Fri 8:35 a.m. - 8:38 a.m.	An Empirical Study of Modular Bias Mitigators and Ensembles ( Short Talk ) > SlidesLive Video	Michael Feffer · Martin Hirzel 🔗
Fri 8:38 a.m. - 8:41 a.m.	A Self-Supervised Automatic Post-Editing Data Generation Tool ( Short Talk ) > SlidesLive Video	HEUISEOK LIM 🔗
Fri 8:41 a.m. - 8:44 a.m.	Robustar: Interactive Toolbox Supporting Precise Data Annotation for Robust Vision Learning ( Short Talk ) > SlidesLive Video	Haohan Wang 🔗
Fri 8:44 a.m. - 8:47 a.m.	Toolbox for Visualizing Effects of Data Instances on Decision Boundaries ( Short Talk ) > SlidesLive Video	Danilo Brajovic 🔗
Fri 8:47 a.m. - 8:50 a.m.	Not All Poisons are Created Equal: Robust Training against Data Poisoning ( Short Talk ) > SlidesLive Video	Yu Yang · Baharan Mirzasoleiman 🔗
Fri 8:50 a.m. - 8:53 a.m.	An Adaptive Deep Clustering Pipeline to Inform Text Labeling at Scale ( Short Talk ) > SlidesLive Video	Xinyu Chen · Ian Beaver 🔗
Fri 9:00 a.m. - 9:45 a.m.	Panel Discussion with Morning Speakers ( Discussion Panel ) > SlidesLive Video	🔗
Fri 9:45 a.m. - 10:30 a.m.	Lunch Break	🔗
Fri 10:30 a.m. - 11:00 a.m.	An Open Conversation about Joint Data and Model Quality: Metrics, Tools, Practices ( Keynote ) > SlidesLive Video	Besmira Nushi 🔗
Fri 11:00 a.m. - 11:15 a.m.	Evaluation of ML in Health/Science ( Invited Talk ) > SlidesLive Video	James Zou 🔗
Fri 11:15 a.m. - 11:30 a.m.	Time Value of Data and AI Strategy ( Invited Talk ) > SlidesLive Video	Ehsan Valavi 🔗
Fri 11:30 a.m. - 11:45 a.m.	Assessing Quality of Information without Ground Truth ( Invited Talk ) > SlidesLive Video	Yiling Chen 🔗
Fri 11:45 a.m. - 12:00 p.m.	Ethical Challenges of Data Collection & Use in Machine Learning Research ( Invited Talk ) > SlidesLive Video	Deborah Raji 🔗
Fri 12:00 p.m. - 12:15 p.m.	Coffee Break	🔗
Fri 12:15 p.m. - 12:30 p.m.	Challenges and Opportunities in Handling Data Distributional Shift ( Invited Talk ) > SlidesLive Video	Sharon Li 🔗
Fri 12:30 p.m. - 12:45 p.m.	Less Data Can Be More! ( Invited Talk ) > SlidesLive Video	Baharan Mirzasoleiman 🔗
Fri 12:45 p.m. - 12:55 p.m.	What Can Data-Centric AI Learn from Data Engineering? ( Invited Talk ) > SlidesLive Video	Matei Zaharia 🔗
Fri 12:55 p.m. - 1:00 p.m.	Dynabench <3 DataPerf ( DataPerf Talk ) > SlidesLive Video	Douwe Kiela 🔗
Fri 1:00 p.m. - 1:05 p.m.	Data Valuation ( DataPerf Talk ) > SlidesLive Video	Newsha Ardalani 🔗
Fri 1:05 p.m. - 1:10 p.m.	Datasets 2030 ( DataPerf Talk ) > SlidesLive Video	Peter Mattson 🔗
Fri 1:10 p.m. - 1:15 p.m.	Short Break	🔗
Fri 1:15 p.m. - 1:18 p.m.	Model-Agnostic Label Quality Scoring to Detect Real-World Label Errors ( Short Talk ) > SlidesLive Video	Jonas Mueller 🔗
Fri 1:18 p.m. - 1:21 p.m.	Open Coding for Machine Learning Data ( Short Talk ) > SlidesLive Video	Magdalena Price · Dylan Hadfield-Menell 🔗
Fri 1:21 p.m. - 1:24 p.m.	Radically Lower Data-Labeling Costs for Document Extraction Models with Selective Labeling ( Short Talk ) > SlidesLive Video	Yichao Zhou 🔗
Fri 1:24 p.m. - 1:27 p.m.	Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features ( Short Talk ) > SlidesLive Video	Qingrui Jia · Xuhong Li 🔗
Fri 1:27 p.m. - 1:30 p.m.	FORML: Learning to Reweight Data for Fairness ( Short Talk ) > SlidesLive Video	Skyler Seto 🔗
Fri 1:30 p.m. - 1:33 p.m.	GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language ( Short Talk ) > SlidesLive Video	Zhiying Zhu · Weixin Liang · James Zou 🔗
Fri 1:33 p.m. - 1:36 p.m.	Robust Synthetic GNN Benchmarks with GraphWorld ( Short Talk ) > SlidesLive Video	John Palowitch 🔗
Fri 1:36 p.m. - 1:39 p.m.	Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge ( Short Talk ) > SlidesLive Video	Nicholas Teague 🔗
Fri 1:39 p.m. - 1:42 p.m.	MRCLens: an MRC Dataset Bias Detection Toolkit ( Short Talk ) > SlidesLive Video	Haohan Wang 🔗
Fri 1:42 p.m. - 1:45 p.m.	FairGen: Fair Synthetic Data Generation ( Short Talk ) > SlidesLive Video	Bhushan Chaudhari · Aakash Agarwal 🔗
Fri 1:45 p.m. - 1:48 p.m.	LAVA: Language Audio Vision Alignment for Data-Efficient Contrastive Learning on Video Data ( Short Talk ) > SlidesLive Video	Sumanth Gurram 🔗
Fri 1:48 p.m. - 1:51 p.m.	Infinite Recommendation Networks: A Data-Centric Approach ( Short Talk ) > SlidesLive Video	Noveen Sachdeva · Carole-Jean Wu · Julian McAuley 🔗
Fri 1:51 p.m. - 1:54 p.m.	Revisiting Hotels-50K and Hotel-ID ( Short Talk ) > SlidesLive Video	Aarash Feizi 🔗
Fri 1:54 p.m. - 1:57 p.m.	TMED 2: A Dataset for Semi-Supervised Classification of Echocardiograms ( Short Talk ) > SlidesLive Video	Michael Hughes 🔗
Fri 1:57 p.m. - 2:00 p.m.	GreenDB - A Data Set and Benchmark for Extraction of Sustainability Information of Consumer Goods ( Short Talk ) > SlidesLive Video	Sebastian Jäger 🔗
Fri 2:00 p.m. - 2:03 p.m.	DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations ( Short Talk ) > SlidesLive Video	Yatao Bian 🔗
Fri 2:15 p.m. - 3:00 p.m.	Panel Discussion with Afternoon Speakers ( Discussion Panel ) > SlidesLive Video	🔗