Timezone: »
Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
Author Information
Boris Muzellec (ENSAE, Institut Polytechnique de Paris)
Julie Josse (Polytechnique)
Claire Boyer (LPSM, Sorbonne Université)
Marco Cuturi (Google)
More from the Same Authors
-
2023 Poster: Conformal Prediction with Missing Values »
Margaux Zaffran · Aymeric Dieuleveut · Julie Josse · Yaniv Romano -
2023 Poster: Naive imputation implicitly regularizes high-dimensional linear models »
Alexis Ayme · Claire Boyer · Aymeric Dieuleveut · Erwan Scornet -
2022 Poster: Near-optimal rate of consistency for linear models with missing values »
Alexis Ayme · Claire Boyer · Aymeric Dieuleveut · Erwan Scornet -
2022 Spotlight: Near-optimal rate of consistency for linear models with missing values »
Alexis Ayme · Claire Boyer · Aymeric Dieuleveut · Erwan Scornet -
2022 Poster: Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs »
Meyer Scetbon · Gabriel Peyré · Marco Cuturi -
2022 Spotlight: Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs »
Meyer Scetbon · Gabriel Peyré · Marco Cuturi -
2021 Poster: Analyzing the tree-layer structure of Deep Forests »
Ludovic Arnould · Claire Boyer · Erwan Scornet -
2021 Spotlight: Analyzing the tree-layer structure of Deep Forests »
Ludovic Arnould · Claire Boyer · Erwan Scornet -
2021 Poster: Low-Rank Sinkhorn Factorization »
Meyer Scetbon · Marco Cuturi · Gabriel Peyré -
2021 Spotlight: Low-Rank Sinkhorn Factorization »
Meyer Scetbon · Marco Cuturi · Gabriel Peyré -
2020 Workshop: Learning with Missing Values »
Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux -
2020 : Opening Session »
Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux -
2020 Poster: Regularized Optimal Transport is Ground Cost Adversarial »
François-Pierre Paty · Marco Cuturi -
2020 Poster: Supervised Quantile Normalization for Low Rank Matrix Factorization »
Marco Cuturi · Olivier Teboul · Jonathan Niles-Weed · Jean-Philippe Vert -
2020 Poster: Debiased Sinkhorn barycenters »
Hicham Janati · Marco Cuturi · Alexandre Gramfort -
2019 Poster: Subspace Robust Wasserstein Distances »
François-Pierre Paty · Marco Cuturi -
2019 Oral: Subspace Robust Wasserstein Distances »
François-Pierre Paty · Marco Cuturi