Workshop
DataPerf: Benchmarking Data for Data-Centric AI
Lora Aroyo · Newsha Ardalani · Colby Banbury · Gregory Diamos · William Gaviria Rojas · Tzu-Sheng Kuo · Mark Mazumder · Peter Mattson · Praveen Paritosh
Ballroom 3
Fri 22 Jul, 5:45 a.m. PDT
This workshop proposal builds on the success of the 1st Data-Centric AI Workshop organized at NeurIPS 2021 (which attracted more than 160 submissions and close to 200 participants) and expands the effort to engage the deeplearning.ai community with the active interdisciplinary MLCommons community of practitioners, researchers and engineers from both academia and industry by presenting the current state-of-the-art, work-in-progress and a set of open problems in the field of benchmarking data for ML. Many of these areas are in a nascent stage, and we hope to further their development by knitting them together into a coherent whole. We seek to drive progress in addressing these core problems by promoting the creation of a set of benchmarks for data quality and data-related algorithms. We want to bring together work that pushes forward this new view of data-centric ML benchmarks, e.g. the initiatives at MLCommons, a non-profit that operates the MLPerf benchmarks that have become standard for AI chip speed but also others including Dynabench, OpenML, data-centric AI hub, etc. We envision MLCommons as providing a framework and resources for the evolution of benchmarks in this space, and our workshop as showcasing the best innovations revealed by those benchmarks and providing a focus event for the community submitting to them.A huge amount of innovation — in algorithms, ideas, principles, and tools — is needed to make data-centric AI development efficient and effective. We hope that this workshop will help spark that innovation.
Schedule
Fri 5:45 a.m. - 6:00 a.m.
|
Welcome
(
Welcome
)
>
SlidesLive Video |
🔗 |
Fri 6:00 a.m. - 6:30 a.m.
|
The Data-Centric AI Competition
(
Keynote
)
>
SlidesLive Video |
Andrew Ng 🔗 |
Fri 6:30 a.m. - 6:45 a.m.
|
Open Images: Lessons Learned from Collecting and Annotating 9M images
(
Invited Talk
)
>
SlidesLive Video |
Jordi Pont-Tuset 🔗 |
Fri 6:45 a.m. - 7:00 a.m.
|
Did We Forget about the Canonical Source of Variance in Machine Learning Pipelines?
(
Invited Talk
)
>
SlidesLive Video |
Xavier Bouthillier 🔗 |
Fri 7:00 a.m. - 7:15 a.m.
|
Coffee Break
|
🔗 |
Fri 7:15 a.m. - 7:30 a.m.
|
Embracing Subjectivity In Machine Learning Benchmarks
(
Invited Talk
)
>
SlidesLive Video |
Kurt Bollacker 🔗 |
Fri 7:30 a.m. - 7:45 a.m.
|
Responsible Evaluation Framework
(
Invited Talk
)
>
SlidesLive Video |
mona Diab 🔗 |
Fri 7:45 a.m. - 7:50 a.m.
|
Data Excellence for Responsible AI
(
DataPerf Talk
)
>
SlidesLive Video |
Lora Aroyo 🔗 |
Fri 7:50 a.m. - 7:55 a.m.
|
DataPerf Effort
(
DataPerf Talk
)
>
SlidesLive Video |
Praveen Paritosh 🔗 |
Fri 7:55 a.m. - 8:00 a.m.
|
DataPerf Challenges
(
DataPerf Talk
)
>
SlidesLive Video |
Cody Coleman · Mark Mazumder · Colby Banbury 🔗 |
Fri 8:00 a.m. - 8:05 a.m.
|
Short Break
|
🔗 |
Fri 8:05 a.m. - 8:08 a.m.
|
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
(
Short Talk
)
>
SlidesLive Video |
Thao Nguyen 🔗 |
Fri 8:08 a.m. - 8:11 a.m.
|
Metadata Representations for Queryable ML Model Zoos
(
Short Talk
)
>
SlidesLive Video |
Ziyu Li 🔗 |
Fri 8:11 a.m. - 8:14 a.m.
|
Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics
(
Short Talk
)
>
SlidesLive Video |
Sara Hooker 🔗 |
Fri 8:14 a.m. - 8:17 a.m.
|
Interpretable Distribution Shift Detection using Optimal Transport
(
Short Talk
)
>
|
Neha Hulkund 🔗 |
Fri 8:17 a.m. - 8:20 a.m.
|
Data Budgeting for Machine Learning
(
Short Talk
)
>
SlidesLive Video |
Weixin Liang · James Zou 🔗 |
Fri 8:20 a.m. - 8:23 a.m.
|
Data Sculpting: Interpretable Algorithm for End-to-End Cohort Selection
(
Short Talk
)
>
SlidesLive Video |
Ruishan Liu · James Zou 🔗 |
Fri 8:23 a.m. - 8:26 a.m.
|
Data Augmentation Techniques for Speech Error Correction
(
Short Talk
)
>
SlidesLive Video |
James Ren 🔗 |
Fri 8:26 a.m. - 8:29 a.m.
|
Data-Centric AI Infra 2.0
(
Short Talk
)
>
SlidesLive Video |
Nikon Rasumov-Rahe 🔗 |
Fri 8:29 a.m. - 8:32 a.m.
|
Beyond Hard Labels: Investigating data label distributions
(
Short Talk
)
>
SlidesLive Video |
Lars Schmarje 🔗 |
Fri 8:32 a.m. - 8:35 a.m.
|
An Operational Metrics Framework for ML Data
(
Short Talk
)
>
SlidesLive Video |
Anoop Sinha 🔗 |
Fri 8:35 a.m. - 8:38 a.m.
|
An Empirical Study of Modular Bias Mitigators and Ensembles
(
Short Talk
)
>
SlidesLive Video |
Michael Feffer · Martin Hirzel 🔗 |
Fri 8:38 a.m. - 8:41 a.m.
|
A Self-Supervised Automatic Post-Editing Data Generation Tool
(
Short Talk
)
>
SlidesLive Video |
HEUISEOK LIM 🔗 |
Fri 8:41 a.m. - 8:44 a.m.
|
Robustar: Interactive Toolbox Supporting Precise Data Annotation for Robust Vision Learning
(
Short Talk
)
>
SlidesLive Video |
Haohan Wang 🔗 |
Fri 8:44 a.m. - 8:47 a.m.
|
Toolbox for Visualizing Effects of Data Instances on Decision Boundaries
(
Short Talk
)
>
SlidesLive Video |
Danilo Brajovic 🔗 |
Fri 8:47 a.m. - 8:50 a.m.
|
Not All Poisons are Created Equal: Robust Training against Data Poisoning
(
Short Talk
)
>
SlidesLive Video |
Yu Yang · Baharan Mirzasoleiman 🔗 |
Fri 8:50 a.m. - 8:53 a.m.
|
An Adaptive Deep Clustering Pipeline to Inform Text Labeling at Scale
(
Short Talk
)
>
SlidesLive Video |
Xinyu Chen · Ian Beaver 🔗 |
Fri 9:00 a.m. - 9:45 a.m.
|
Panel Discussion with Morning Speakers
(
Discussion Panel
)
>
SlidesLive Video |
🔗 |
Fri 9:45 a.m. - 10:30 a.m.
|
Lunch Break
|
🔗 |
Fri 10:30 a.m. - 11:00 a.m.
|
An Open Conversation about Joint Data and Model Quality: Metrics, Tools, Practices
(
Keynote
)
>
SlidesLive Video |
Besmira Nushi 🔗 |
Fri 11:00 a.m. - 11:15 a.m.
|
Evaluation of ML in Health/Science
(
Invited Talk
)
>
SlidesLive Video |
James Zou 🔗 |
Fri 11:15 a.m. - 11:30 a.m.
|
Time Value of Data and AI Strategy
(
Invited Talk
)
>
SlidesLive Video |
Ehsan Valavi 🔗 |
Fri 11:30 a.m. - 11:45 a.m.
|
Assessing Quality of Information without Ground Truth
(
Invited Talk
)
>
SlidesLive Video |
Yiling Chen 🔗 |
Fri 11:45 a.m. - 12:00 p.m.
|
Ethical Challenges of Data Collection & Use in Machine Learning Research
(
Invited Talk
)
>
SlidesLive Video |
Deborah Raji 🔗 |
Fri 12:00 p.m. - 12:15 p.m.
|
Coffee Break
|
🔗 |
Fri 12:15 p.m. - 12:30 p.m.
|
Challenges and Opportunities in Handling Data Distributional Shift
(
Invited Talk
)
>
SlidesLive Video |
Sharon Li 🔗 |
Fri 12:30 p.m. - 12:45 p.m.
|
Less Data Can Be More!
(
Invited Talk
)
>
SlidesLive Video |
Baharan Mirzasoleiman 🔗 |
Fri 12:45 p.m. - 12:55 p.m.
|
What Can Data-Centric AI Learn from Data Engineering?
(
Invited Talk
)
>
SlidesLive Video |
Matei Zaharia 🔗 |
Fri 12:55 p.m. - 1:00 p.m.
|
Dynabench <3 DataPerf
(
DataPerf Talk
)
>
SlidesLive Video |
Douwe Kiela 🔗 |
Fri 1:00 p.m. - 1:05 p.m.
|
Data Valuation
(
DataPerf Talk
)
>
SlidesLive Video |
Newsha Ardalani 🔗 |
Fri 1:05 p.m. - 1:10 p.m.
|
Datasets 2030
(
DataPerf Talk
)
>
SlidesLive Video |
Peter Mattson 🔗 |
Fri 1:10 p.m. - 1:15 p.m.
|
Short Break
|
🔗 |
Fri 1:15 p.m. - 1:18 p.m.
|
Model-Agnostic Label Quality Scoring to Detect Real-World Label Errors
(
Short Talk
)
>
SlidesLive Video |
Jonas Mueller 🔗 |
Fri 1:18 p.m. - 1:21 p.m.
|
Open Coding for Machine Learning Data
(
Short Talk
)
>
SlidesLive Video |
Magdalena Price · Dylan Hadfield-Menell 🔗 |
Fri 1:21 p.m. - 1:24 p.m.
|
Radically Lower Data-Labeling Costs for Document Extraction Models with Selective Labeling
(
Short Talk
)
>
SlidesLive Video |
Yichao Zhou 🔗 |
Fri 1:24 p.m. - 1:27 p.m.
|
Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features
(
Short Talk
)
>
SlidesLive Video |
Qingrui Jia · Xuhong Li 🔗 |
Fri 1:27 p.m. - 1:30 p.m.
|
FORML: Learning to Reweight Data for Fairness
(
Short Talk
)
>
SlidesLive Video |
Skyler Seto 🔗 |
Fri 1:30 p.m. - 1:33 p.m.
|
GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language
(
Short Talk
)
>
SlidesLive Video |
Zhiying Zhu · Weixin Liang · James Zou 🔗 |
Fri 1:33 p.m. - 1:36 p.m.
|
Robust Synthetic GNN Benchmarks with GraphWorld
(
Short Talk
)
>
SlidesLive Video |
John Palowitch 🔗 |
Fri 1:36 p.m. - 1:39 p.m.
|
Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge
(
Short Talk
)
>
SlidesLive Video |
Nicholas Teague 🔗 |
Fri 1:39 p.m. - 1:42 p.m.
|
MRCLens: an MRC Dataset Bias Detection Toolkit
(
Short Talk
)
>
SlidesLive Video |
Haohan Wang 🔗 |
Fri 1:42 p.m. - 1:45 p.m.
|
FairGen: Fair Synthetic Data Generation
(
Short Talk
)
>
SlidesLive Video |
Bhushan Chaudhari · Aakash Agarwal 🔗 |
Fri 1:45 p.m. - 1:48 p.m.
|
LAVA: Language Audio Vision Alignment for Data-Efficient Contrastive Learning on Video Data
(
Short Talk
)
>
SlidesLive Video |
Sumanth Gurram 🔗 |
Fri 1:48 p.m. - 1:51 p.m.
|
Infinite Recommendation Networks: A Data-Centric Approach
(
Short Talk
)
>
SlidesLive Video |
Noveen Sachdeva · Carole-Jean Wu · Julian McAuley 🔗 |
Fri 1:51 p.m. - 1:54 p.m.
|
Revisiting Hotels-50K and Hotel-ID
(
Short Talk
)
>
SlidesLive Video |
Aarash Feizi 🔗 |
Fri 1:54 p.m. - 1:57 p.m.
|
TMED 2: A Dataset for Semi-Supervised Classification of Echocardiograms
(
Short Talk
)
>
SlidesLive Video |
Michael Hughes 🔗 |
Fri 1:57 p.m. - 2:00 p.m.
|
GreenDB - A Data Set and Benchmark for Extraction of Sustainability Information of Consumer Goods
(
Short Talk
)
>
SlidesLive Video |
Sebastian Jäger 🔗 |
Fri 2:00 p.m. - 2:03 p.m.
|
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations
(
Short Talk
)
>
SlidesLive Video |
Yatao Bian 🔗 |
Fri 2:15 p.m. - 3:00 p.m.
|
Panel Discussion with Afternoon Speakers
(
Discussion Panel
)
>
SlidesLive Video |
🔗 |