Learning with Missing Values

Workshop

Learning with Missing Values

Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux

Keywords: Graphical Models Missing values Matrix Completion Record Linkage Selection Bias

[ Abstract ] Workshop Website

Analysis of large amounts of data offers new opportunities to understand many processes better. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources, leading to many observations with missing features. From questionnaires to collaborative filtering, from electronic health records to single-cell analysis, missingness is everywhere at play and is rather the norm than the exception. Even “clean” data sets are often barely “cleaned” versions of incomplete data sets—with all the unfortunate biases this cleaning process may have created.

Despite this ubiquity, tackling missing values is often overlooked. Handling missing values poses many challenges, and there is a vast literature in the statistical community, with many implementations available. Yet, there are still many open issues and the need to design new methods or to introduce new point of views: for missing values in a supervised-learning setting, in deep learning architectures, to adapt available methods for high dimensional observed data with different type of missing values, deal with feature mismatch and distribution mismatch. Missing data is one of the eight pillars of causal wisdom for Judea Pearl who brought graphical model reasoning to tackle some missing not at random values.

To the best of our knowledge, this is the first workshop at the major machine learning conferences focusing primarily on missing value problems in recent years. The goal of our workshop is to give more momentum and exposition to research on missing values, both theoretical and methodological, and emphasize the connections with other areas of machine learning (e.g. causal inference, generative modelling, uncertainty quantification, transfer learning, distributional shift, etc.). We will also attach importance to discussing the reproducibility problems that can be caused by missing data, the danger of forgetting the missing values issues and the importance of providing sound implementations.

We welcome both academic and industrial practitioners/researchers. In particular, since missing data is a critical issue in many applications, we would like to federate industrial/applied know-how and various academic approaches.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 1:45 a.m. - 2:00 a.m.	Opening Session ( Discussion ) >	Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux 🔗
Fri 2:00 a.m. - 3:00 a.m.	Poster session 1 ( Posters ) > Please do not share or post Zoom links A Random Matrix Analysis of Learning with α-Dropout Mohamed El Amine Seddik, Romain Couillet, Mohamed Tamaazousti [ protected link dropped ] Visna---Visualising Multivariate Missing Values Antony Unwin, Alexander Pilhoefer [ protected link dropped ] Multi-output prediction of global vegetation distribution with incomplete data Rita Beigaite, Jesse Read, Indre Zliobaite [ protected link dropped ] Path Imputation Strategies for Signature Models Michael Moor, Max Horn, Christian Bock, Karsten Borgwardt, Bastian Rieck [ protected link dropped ] Lung Segmentation from Chest X-rays using Variational Data Imputation Raghavendra Selvan, Erik Dam, Nicki Skafte Detlefsen, Sofus Rischel, Kaining Sheng, Mads Nielsen, Akshay Pai [ protected link dropped ] Clustering Data with nonignorable Missingness using Semi-Parametric Mixture Models Marie Du Roy de Chaumaray, Matthieu Marbac [ protected link dropped ] Estimating conditional density of missing values using deep Gaussian mixture model Marcin Przewięźlikowski, Marek Śmieja, Łukasz Struski [ protected link dropped ] Missing the Point: Non-Convergence in Iterative Imputation Algorithms Hanne I. Oberman, Stef van Buuren, Gerko Vink [ protected link dropped ] The Dynamic Latent Block Model for Sparse and Evolving Count Matrices Giulia Marchello, Marco Corneli, Charles Bouveyron [ protected link dropped ] Predicting Feature Imputability in the Absence of Ground Truth Niamh McCombe, Xuemei Ding, Girijesh Prasad, David P Finn, Stephen Todd, Paula L McClean, Kongfatt Wong-Lin [ protected link dropped ] Missing rating imputation based on product reviews via deep latent variable models Dingge Liang, Marco Corneli, Pierre Latouche, Charles Bouveyron [ protected link dropped ] Inferring Causal Dependencies between Chaotic Dynamical Systems from Sporadic Time Series Edward De Brouwer, Adam Arany, Jaak Simm, Yves Moreau [ protected link dropped ] The impact of incomplete data on quantile regression for longitudinal data Anneleen Verhasselt, Alvaro José Flórez, Ingrid Van Keilegom, Geert Molenberghs [ protected link dropped ] Multi-label Learning with Missing Values using Combined Facial Action Unit Datasets Jaspar Pahl, Ines Rieger, Dominik Seuss [ protected link dropped ] A Study on Intentional-Value-Substitution Training for Regression with Incomplete Information Takuya Fukushima, Tomoharu Nakashima, Taku Hasegawa, Vicenç Torra [ protected link dropped ] How to miss data? Reinforcement learning for environments with high observation cost Mehmet Koseoglu, Ayca Ozcelikkale [ protected link dropped ] How to deal with missing data in supervised deep learning? Niels Bruun Ipsen, Pierre-Alexandre Mattei, Jes Frellsen [ protected link dropped ] VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data Chao Ma, Sebastian Tschiatschek, Richard E. Turner, José Miguel Hernández-Lobato, Cheng Zhang [ protected link dropped ] Working with Deep Generative Models and Tabular Data Imputation Ramiro Camino, Christian Hammerschmidt, Radu State [ protected link dropped ]	🔗
Fri 4:30 a.m. - 5:10 a.m.	Invited Talk: Learning despite the unknown - missing data imputation in healthcare ( Talk ) > link SlidesLive Video https://slideslive.com/38930914/learning-despite-the-unknown-missing-data-imputation-in-healthcare?ref=account-folder-55866-folders Link	Mihaela van der Schaar 🔗
Fri 5:10 a.m. - 5:50 a.m.	Invited Talk: Imputing Missing Data with the Gaussian Copula ( Talk ) > link Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. This talk introduces a new semiparametric algorithm to impute missing values, with no tuning parameters. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of the proposed algorithm to state-of-the-art imputation algorithms for mixed data. Link	Madeleine Udell 🔗
Fri 5:50 a.m. - 6:30 a.m.	Discussion and Q&A by Gael Varoquaux, Julie Josse and Pierre Alexandre Mattei ( Discussion Panel ) >	🔗
Fri 6:30 a.m. - 7:10 a.m.	Invited Talk: Efficient Missing-value Acquisition with Variational Autoencoders ( Talk ) > link SlidesLive Video Abstract: In many real-world problems we have to make predictions from feature vectors with missing values. However, we may also be able to observe some of the missing values in the feature vector at a cost. Given the currently observed values, how can we decide which missing values to observe next so that prediction accuracy increases as fast as possible as a function of the observation cost? This problem appears in many different application areas, including medical diagnosis, surveys, recommender systems, insurance, etc. In this talk, I will describe how to solve the problem using an information theoretic approach and novel variational autoencoder models that can effectively deal with missing data. Link	Jose Miguel Hernandez-Lobato 🔗
Fri 7:10 a.m. - 7:50 a.m.	Invited Talk: What Interpretable Machine Learning Can Tell Us About Missing Values ( Talk ) > link SlidesLive Video Missing values are everywhere, and I’ve been dealing with them one way or another for many years. Recently I’ve been doing research in interpretable machine learning. To my surprise, interpretable machine learning has completely changed how I work with missing values. Interpretable learning provides new methods for detecting, understanding, and modeling missing values. In the presentation I’ll show a few surprises where interpretability makes it clear the impact missing values have been having on our machine learning models all along, but which are only visible now thanks to interpretable methods. Link	Rich Caruana 🔗
Fri 7:50 a.m. - 8:30 a.m.	Discussion and Q&A by Gael Varoquaux and Jes Frellsen ( Discussion ) >	🔗
Fri 8:30 a.m. - 9:10 a.m.	Poster session 2 ( Poster ) > Please do not share or post Zoom links Optimal recovery of missing values for non-negative matrix factorization: A probabilistic error bound Rebecca Chen, Lav R. Varshney [ protected link dropped ] Causal Discovery in the Presence of Missing Values for Neuropathic Pain Diagnosis Ruibo Tu, Kun Zhang, Bo Christer Bertilson, Clark Glymour, Hedvig Kjellström, Cheng Zhang [ protected link dropped ] Does imputation matter? Benchmark for real-life classification problems. Katarzyna Woźnica, Przemyslaw Biecek [ protected link dropped ] VAEs in the Presence of Missing Data Mark Collier, Alfredo Nazabal, Chris Williams [ protected link dropped ] Variance estimation after Kernel Ridge Regression Imputation Hengfang Wang, Jae Kwang Kim [ protected link dropped ] Online Mixed Missing Value Imputation Using Gaussian Copula Eric Landgrebe, yuxuan zhao, Madeleine Udell [ protected link dropped ] Imputation of Missing Behavioral Measures in Connectome-based Predictive Modelling Qinghao Liang, Dustin Scheinost [ protected link dropped ] Handling Missing Data in Decision Trees: A Probabilistic Approach Pasha Khosravi, antonio vergari, YooJung Choi, Yitao Liang, Guy Van den Broeck [ protected link dropped ] Processing of incomplete images by (graph) convolutional neural networks Tomasz Danel, Marek Śmieja, Łukasz Struski, Przemysław Spurek, Lukasz Maziarka [ protected link dropped ] Conditioning on "and nothing else": Simple Models of Missing Data between Naive Bayes and Logistic Regression David Poole, Ali Mohammad Mehr, Wan Shing Martin Wang [ protected link dropped ] Multi-Time Attention Networks for Irregularly Sampled Time Series Satya Narayan Shukla, Benjamin Marlin [ protected link dropped ] Information Theoretic Approaches for Testing Missingness in Predictive Models Shreyas A Bhave, Rajesh Ranganath, Adler Perotte [ protected link dropped ]	🔗
Fri 9:10 a.m. - 9:50 a.m.	Invited Talk: Graphical Models based Solutions for Missing Data Problems. ( Talk Live ) > link “Missingness Graphs” (m-graphs) are causal graphical models used for processing missing data. They portray the causal mechanisms responsible for missingness and thus encode knowledge about the underlying process that generates data. Using m-graphs, we develop methods to determine if there exists a consistent estimator for a given quantity of interest such as joint distributions, conditional distributions and causal effects. Our methods apply to all types of missing data including the notorious and relatively unexplored NMAR (Not Missing At Random) category. We further address the question of testability i.e. if and how an assumed model can be subjected to statistical tests, considering the missingness in the data. Viewing the missing data problem from a causal perspective has ushered in several surprises such as recoverability when variables are causes of their own missingness, testability of MAR models and the indispensability of causal assumptions for handling missing data problems. Link	Karthika Mohan 🔗
Fri 9:50 a.m. - 10:30 a.m.	Invited Talk: Sequentially additive nonignorable missing data modelling using auxiliary marginal information ( Talk Live ) > We study a class of missingness mechanisms, referred to as sequentially additive nonignorable, for modelling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We prove identification results and illustrate the use of these mechanisms in an application. Paper: https://academic.oup.com/biomet/article-abstract/106/4/889/5607583 In case any issue with live talk: https://washington.zoom.us/rec/share/1MJpC7_px35IGYXA9E3Dc4F9QoTMX6a82iYY-qINmhpn5YZJF5wb7duN3Jf-WKpd	Mauricio Sadinle 🔗
Fri 10:30 a.m. - 11:10 a.m.	Discussion and Q&A by Ilya Shpitser - Identifiability of the full law in graphical missing data models ( Discussion Panel ) >	🔗
Fri 11:10 a.m. -	Informal gathering with drinks to celebrate ( Discussion Panel ) >	🔗