Timezone: »
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. Here we consider estimating which data values are incorrect along a numerical column. We present a model-agnostic approach that can utilize any regressor (i.e. statistical or machine learning model) which was fit to predict values in this column based on the other variables in the dataset. By accounting for various uncertainties, our approach distinguishes between genuine anomalies and natural data fluctuations, conditioned on the available information in the dataset. We establish theoretical guarantees for our method and show that other approaches like conformal inference struggle to detect errors. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
Author Information
Hang Zhou (UC Davis)
Jonas Mueller (Cleanlab)
Mayank Kumar (Cleanlab)
Jane-Ling Wang (UC Davis)
Jing Lei (Carnegie Mellon University)
More from the Same Authors
-
2021 : Multimodal AutoML on Structured Tables with Text Fields »
Xingjian Shi · Jonas Mueller · Nick Erickson · Mu Li · Alex Smola -
2021 : Continuous Doubly Constrained Batch Reinforcement Learning »
Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Pratik Chaudhari · Alex Smola -
2022 : Adaptive Interest for Emphatic Reinforcement Learning »
Martin Klissarov · Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Taesup Kim · Alex Smola -
2022 : Back to the Basics: Revisiting Out-of-Distribution Detection Baselines »
Johnson Kuan · Jonas Mueller -
2023 : How to Cope with Gradual Data Drift? »
Rasool Fakoor · Jonas Mueller · Zachary Lipton · Pratik Chaudhari · Alex Smola -
2023 : Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors »
Jesse Cummings · Jonas Mueller · ElĂas Snorrason -
2023 : Estimating label quality and errors in semantic segmentation data via any model »
Vedang Lad · Jonas Mueller -
2023 : ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data »
Ulyana Tkachenko · Aditya Thyagarajan · Jonas Mueller -
2022 : Model-Agnostic Label Quality Scoring to Detect Real-World Label Errors »
Jonas Mueller -
2021 : Q&A Contributed Talk »
Jonas Mueller -
2021 : Contributed Talk: Multimodal AutoML on Structured Tables with Text Fields »
Jonas Mueller -
2021 Poster: Deep Learning for Functional Data Analysis with Adaptive Basis Layers »
Junwen Yao · Jonas Mueller · Jane-Ling Wang -
2021 Spotlight: Deep Learning for Functional Data Analysis with Adaptive Basis Layers »
Junwen Yao · Jonas Mueller · Jane-Ling Wang -
2020 : 1.2 AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data »
Jonas Mueller -
2020 Poster: Educating Text Autoencoders: Latent Representation Guidance via Denoising »
Tianxiao Shen · Jonas Mueller · Regina Barzilay · Tommi Jaakkola