ICML DuckDQ: Data Quality Assertions for Machine Learning Pipelines

Contributed Talk
in
Workshop: Challenges in Deploying and monitoring Machine Learning Systems

DuckDQ: Data Quality Assertions for Machine Learning Pipelines

Till Döhmen

[ Abstract ]

Abstract:

Data quality validation plays an important role in ensuring the proper behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production systems, we developed DuckDQ: A lightweight and efficient Python library for data quality validation, that seamlessly integrates with existing scikit-learn ML pipelines and does not require a distributed computing environment or ML platform infrastructure, while outperforming existing solutions by a factor 3 to 40 in terms of runtime. We introduce the notion of data quality assertions, which can stop a pipeline when quality constraints of the input data or the model's output are not met. Furthermore, we employ stateful metric computations, which greatly enhance the possibilities for post-hoc failure analysis and drift detection, even when the serving data is not around anymore.

Authors Till Doehmen ( Fraunhofer FIT ) Mark Raasveldt ( CWI ) Hannes Mühleisen ( Centrum Wiskunde & Informatica ) Sebastian Schelter ( University of Amsterdam )

Contributed Talk in Workshop: Challenges in Deploying and monitoring Machine Learning Systems

DuckDQ: Data Quality Assertions for Machine Learning Pipelines

Till Döhmen

Contributed Talk
in
Workshop: Challenges in Deploying and monitoring Machine Learning Systems