Learning with Missing Values

Julie Josse, Jes Frellsen, Pierre-Alexandre Mattei, Gael Varoquaux

Keywords:  Missing values    Matrix Completion    Record Linkage    Graphical models    Selection Bias  


Analysis of large amounts of data offers new opportunities to understand many processes better. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources, leading to many observations with missing features. From questionnaires to collaborative filtering, from electronic health records to single-cell analysis, missingness is everywhere at play and is rather the norm than the exception. Even “clean” data sets are often barely “cleaned” versions of incomplete data sets—with all the unfortunate biases this cleaning process may have created.

Despite this ubiquity, tackling missing values is often overlooked. Handling missing values poses many challenges, and there is a vast literature in the statistical community, with many implementations available. Yet, there are still many open issues and the need to design new methods or to introduce new point of views: for missing values in a supervised-learning setting, in deep learning architectures, to adapt available methods for high dimensional observed data with different type of missing values, deal with feature mismatch and distribution mismatch. Missing data is one of the eight pillars of causal wisdom for Judea Pearl who brought graphical model reasoning to tackle some missing not at random values.

To the best of our knowledge, this is the first workshop at the major machine learning conferences focusing primarily on missing value problems in recent years. The goal of our workshop is to give more momentum and exposition to research on missing values, both theoretical and methodological, and emphasize the connections with other areas of machine learning (e.g. causal inference, generative modelling, uncertainty quantification, transfer learning, distributional shift, etc.). We will also attach importance to discussing the reproducibility problems that can be caused by missing data, the danger of forgetting the missing values issues and the importance of providing sound implementations.

We welcome both academic and industrial practitioners/researchers. In particular, since missing data is a critical issue in many applications, we would like to federate industrial/applied know-how and various academic approaches.

Chat is not available.

Timezone: »


Fri 1:45 a.m. - 2:00 a.m. [iCal]
Opening Session (Discussion)
Julie Josse, Jes Frellsen, Pierre-Alexandre Mattei, Gael Varoquaux
Fri 2:00 a.m. - 3:00 a.m. [iCal]

Please do not share or post Zoom links

A Random Matrix Analysis of Learning with α-Dropout
Mohamed El Amine Seddik, Romain Couillet, Mohamed Tamaazousti
[Paper] [Poster] [Join Zoom]

Visna---Visualising Multivariate Missing Values
Antony Unwin, Alexander Pilhoefer
[Paper] [Poster] [Join Zoom]

Multi-output prediction of global vegetation distribution with incomplete data
Rita Beigaite, Jesse Read, Indre Zliobaite
[Paper] [Poster] [Join Zoom]

Path Imputation Strategies for Signature Models
Michael Moor, Max Horn, Christian Bock, Karsten Borgwardt, Bastian Rieck
[Paper] [Poster] [Join Zoom]

Lung Segmentation from Chest X-rays using Variational Data Imputation
Raghavendra Selvan, Erik Dam, Nicki Skafte Detlefsen, Sofus Rischel, Kaining Sheng, Mads Nielsen, Akshay Pai
[Paper] [Poster] [Join Zoom]

Clustering Data with nonignorable Missingness using Semi-Parametric Mixture Models
Marie Du Roy de Chaumaray, Matthieu Marbac
[Paper] [Poster] [Join Zoom]

Estimating conditional density of missing values using deep Gaussian mixture model
Marcin Przewięźlikowski, Marek Śmieja, Łukasz Struski
[Paper] [Poster] [Join Zoom]

Missing the Point: Non-Convergence in Iterative Imputation Algorithms
Hanne I. Oberman, Stef van Buuren, Gerko Vink
[Paper] [Poster] [Join Zoom]

The Dynamic Latent Block Model for Sparse and Evolving Count Matrices
Giulia Marchello, Marco Corneli, Charles Bouveyron
[Paper] [Poster] [Join Zoom]

Predicting Feature Imputability in the Absence of Ground Truth
Niamh McCombe, Xuemei Ding, Girijesh Prasad, David P Finn, Stephen Todd, Paula L McClean, Kongfatt Wong-Lin
[Paper] [Poster] [Join Zoom]

Missing rating imputation based on product reviews via deep latent variable models
Dingge Liang, Marco Corneli, Pierre Latouche, Charles Bouveyron
[Paper] [Poster] [Join Zoom]

Inferring Causal Dependencies between Chaotic Dynamical Systems from Sporadic Time Series
Edward De Brouwer, Adam Arany, Jaak Simm, Yves Moreau
[Paper] [Poster] [Join Zoom]

The impact of incomplete data on quantile regression for longitudinal data
Anneleen Verhasselt, Alvaro José Flórez, Ingrid Van Keilegom, Geert Molenberghs
[Paper] [Poster] [Join Zoom]

Multi-label Learning with Missing Values using Combined Facial Action Unit Datasets
Jaspar Pahl, Ines Rieger, Dominik Seuss
[Paper] [Poster] [Join Zoom]

A Study on Intentional-Value-Substitution Training for Regression with Incomplete Information
Takuya Fukushima, Tomoharu Nakashima, Taku Hasegawa, Vicenç Torra
[Paper] [Poster] [Join Zoom]

How to miss data? Reinforcement learning for environments with high observation cost
Mehmet Koseoglu, Ayca Ozcelikkale
[Paper] [Poster] [Join Zoom]

How to deal with missing data in supervised deep learning?
Niels Bruun Ipsen, Pierre-Alexandre Mattei, Jes Frellsen
[Paper] [Poster] [Join Zoom]

VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data
Chao Ma, Sebastian Tschiatschek, Richard E. Turner, José Miguel Hernández-Lobato, Cheng Zhang
[Paper] [Poster] [Join Zoom]

Working with Deep Generative Models and Tabular Data Imputation
Ramiro Camino, Christian Hammerschmidt, Radu State
[Paper] [Poster] [Join Zoom]

Fri 4:30 a.m. - 5:10 a.m. [iCal]

Mihaela van der Schaar
Fri 5:10 a.m. - 5:50 a.m. [iCal]

Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. This talk introduces a new semiparametric algorithm to impute missing values, with no tuning parameters. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of the proposed algorithm to state-of-the-art imputation algorithms for mixed data.

Madeleine Udell
Fri 5:50 a.m. - 6:30 a.m. [iCal]
Discussion and Q&A by Gael Varoquaux, Julie Josse and Pierre Alexandre Mattei (Discussion Panel)
Fri 6:30 a.m. - 7:10 a.m. [iCal]

Abstract: In many real-world problems we have to make predictions from feature vectors with missing values. However, we may also be able to observe some of the missing values in the feature vector at a cost. Given the currently observed values, how can we decide which missing values to observe next so that prediction accuracy increases as fast as possible as a function of the observation cost? This problem appears in many different application areas, including medical diagnosis, surveys, recommender systems, insurance, etc. In this talk, I will describe how to solve the problem using an information theoretic approach and novel variational autoencoder models that can effectively deal with missing data.

Jose Miguel Hernandez-Lobato
Fri 7:10 a.m. - 7:50 a.m. [iCal]

Missing values are everywhere, and I’ve been dealing with them one way or another for many years. Recently I’ve been doing research in interpretable machine learning. To my surprise, interpretable machine learning has completely changed how I work with missing values. Interpretable learning provides new methods for detecting, understanding, and modeling missing values. In the presentation I’ll show a few surprises where interpretability makes it clear the impact missing values have been having on our machine learning models all along, but which are only visible now thanks to interpretable methods.

Rich Caruana
Fri 7:50 a.m. - 8:30 a.m. [iCal]
Discussion and Q&A by Gael Varoquaux and Jes Frellsen (Discussion)
Fri 8:30 a.m. - 9:10 a.m. [iCal]

Please do not share or post Zoom links

Optimal recovery of missing values for non-negative matrix factorization: A probabilistic error bound
Rebecca Chen, Lav R. Varshney
[Paper] [Poster] [Join Zoom]

Causal Discovery in the Presence of Missing Values for Neuropathic Pain Diagnosis
Ruibo Tu, Kun Zhang, Bo Christer Bertilson, Clark Glymour, Hedvig Kjellström, Cheng Zhang
[Paper] [Poster] [Join Zoom]

Does imputation matter? Benchmark for real-life classification problems.
Katarzyna Woźnica, Przemyslaw Biecek
[Paper] [Poster] [Join Zoom]

VAEs in the Presence of Missing Data
Mark Collier, Alfredo Nazabal, Chris Williams
[Paper] [Poster] [Join Zoom]

Variance estimation after Kernel Ridge Regression Imputation
Hengfang Wang, Jae Kwang Kim
[Paper] [Poster] [Join Zoom]

Online Mixed Missing Value Imputation Using Gaussian Copula
Eric Landgrebe, yuxuan zhao, Madeleine Udell
[Paper] [Poster] [Join Zoom]

Imputation of Missing Behavioral Measures in Connectome-based Predictive Modelling
Qinghao Liang, Dustin Scheinost
[Paper] [Poster] [Join Zoom]

Handling Missing Data in Decision Trees: A Probabilistic Approach
Pasha Khosravi, antonio vergari, YooJung Choi, Yitao Liang, Guy Van den Broeck
[Paper] [Poster] [Join Zoom]

Processing of incomplete images by (graph) convolutional neural networks
Tomasz Danel, Marek Śmieja, Łukasz Struski, Przemysław Spurek, Lukasz Maziarka
[Paper] [Poster] [Join Zoom]

Conditioning on "and nothing else": Simple Models of Missing Data between Naive Bayes and Logistic Regression
David Poole, Ali Mohammad Mehr, Wan Shing Martin Wang
[Paper] [Poster] [Join Zoom]

Multi-Time Attention Networks for Irregularly Sampled Time Series
Satya Narayan Shukla, Benjamin Marlin
[Paper] [Poster] [Join Zoom]

Information Theoretic Approaches for Testing Missingness in Predictive Models
Shreyas A Bhave, Rajesh Ranganath, Adler Perotte
[Paper] [Poster] [Join Zoom]

Fri 9:10 a.m. - 9:50 a.m. [iCal]

“Missingness Graphs” (m-graphs) are causal graphical models used for processing missing data. They portray the causal mechanisms responsible for missingness and thus encode knowledge about the underlying process that generates data. Using m-graphs, we develop methods to determine if there exists a consistent estimator for a given quantity of interest such as joint distributions, conditional distributions and causal effects. Our methods apply to all types of missing data including the notorious and relatively unexplored NMAR (Not Missing At Random) category. We further address the question of testability i.e. if and how an assumed model can be subjected to statistical tests, considering the missingness in the data. Viewing the missing data problem from a causal perspective has ushered in several surprises such as recoverability when variables are causes of their own missingness, testability of MAR models and the indispensability of causal assumptions for handling missing data problems.

Karthika Mohan
Fri 9:50 a.m. - 10:30 a.m. [iCal]

We study a class of missingness mechanisms, referred to as sequentially additive nonignorable, for modelling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We prove identification results and illustrate the use of these mechanisms in an application.


In case any issue with live talk:

Mauricio Sadinle
Fri 10:30 a.m. - 11:10 a.m. [iCal]
Discussion and Q&A by Ilya Shpitser - Identifiability of the full law in graphical missing data models (Discussion Panel)
Fri 11:10 a.m. - [iCal]
Informal gathering with drinks to celebrate (Discussion Panel)