Timezone: »
Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a data-centric AI framework to identify these regions, independent of a task-specific model. Data-SUITE leverages copula modeling, representation learning, and conformal prediction to build feature-wise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data's limitations or guide future data collection? We empirically validate Data-SUITE's performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.
Author Information
Nabeel Seedat (University of Cambridge)
Jonathan Crabbé (University of Cambridge)
Mihaela van der Schaar (University of Cambridge and UCLA)

Professor van der Schaar is John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge, a Turing Faculty Fellow at The Alan Turing Institute in London, and Chancellor's Professor at UCLA. She was elected IEEE Fellow in 2009. She has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), an NSF Career Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award. She holds 35 granted USA patents. In 2019, she was identified by National Endowment for Science, Technology and the Arts as the female researcher based in the UK with the most publications in the field of AI. She was also elected as a 2019 "Star in Computer Networking and Communications". Her current research focus is on machine learning, AI and operations research for healthcare and medicine. For more details, see her website: http://www.vanderschaar-lab.com/
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Spotlight: Data-SUITE: Data-centric identification of in-distribution incongruous examples »
Thu. Jul 21st 08:00 -- 08:05 PM Room Room 318 - 320
More from the Same Authors
-
2023 Poster: Differentiable and Transportable Structure Learning »
Jeroen Berrevoets · Nabeel Seedat · Fergus Imrie · Mihaela van der Schaar -
2022 Poster: Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations »
Nabeel Seedat · Fergus Imrie · Alexis Bellot · Zhaozhi Qian · Mihaela van der Schaar -
2022 Spotlight: Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations »
Nabeel Seedat · Fergus Imrie · Alexis Bellot · Zhaozhi Qian · Mihaela van der Schaar -
2022 Poster: Label-Free Explainability for Unsupervised Models »
Jonathan Crabbé · Mihaela van der Schaar -
2022 Spotlight: Label-Free Explainability for Unsupervised Models »
Jonathan Crabbé · Mihaela van der Schaar -
2021 : Mihaela Van der Schaar: Time-series in healthcare: challenges and solutions »
Mihaela van der Schaar -
2021 : Quantitative epistemology: conceiving a new human-machine partnership »
Mihaela van der Schaar -
2021 Poster: Explaining Time Series Predictions with Dynamic Masks »
Jonathan Crabbé · Mihaela van der Schaar -
2021 Spotlight: Explaining Time Series Predictions with Dynamic Masks »
Jonathan Crabbé · Mihaela van der Schaar -
2021 : Synthetic Healthcare Data Generation and Assessment: Challenges, Methods, and Impact on Machine Learning »
Ahmed M. Alaa · Mihaela van der Schaar -
2020 : Panel Discussion »
Neil Lawrence · Mihaela van der Schaar · Alex Smola · Valerio Perrone · Jack Parker-Holder · Zhengying Liu -
2020 : "Automated ML and its transformative impact on medicine and healthcare" by Mihaela van der Schaar »
Mihaela van der Schaar -
2020 : Invited Talk: Learning despite the unknown - missing data imputation in healthcare »
Mihaela van der Schaar -
2019 Poster: Validating Causal Inference Models via Influence Functions »
Ahmed Alaa · Mihaela van der Schaar -
2019 Oral: Validating Causal Inference Models via Influence Functions »
Ahmed Alaa · Mihaela van der Schaar -
2018 Poster: AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning »
Ahmed M. Alaa · Mihaela van der Schaar -
2018 Oral: AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning »
Ahmed M. Alaa · Mihaela van der Schaar -
2018 Poster: Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design »
Ahmed M. Alaa · Mihaela van der Schaar -
2018 Oral: Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design »
Ahmed M. Alaa · Mihaela van der Schaar -
2017 Poster: Learning from Clinical Judgments: Semi-Markov-Modulated Marked Hawkes Processes for Risk Prognosis »
Ahmed M. Alaa · Scott B Hu · Mihaela van der Schaar -
2017 Talk: Learning from Clinical Judgments: Semi-Markov-Modulated Marked Hawkes Processes for Risk Prognosis »
Ahmed M. Alaa · Scott B Hu · Mihaela van der Schaar