Workshop
Learning with Missing Values
Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux
Keywords: Graphical Models Missing values Matrix Completion Record Linkage Selection Bias
Analysis of large amounts of data offers new opportunities to understand many processes better. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources, leading to many observations with missing features. From questionnaires to collaborative filtering, from electronic health records to single-cell analysis, missingness is everywhere at play and is rather the norm than the exception. Even “clean” data sets are often barely “cleaned” versions of incomplete data sets—with all the unfortunate biases this cleaning process may have created.
Despite this ubiquity, tackling missing values is often overlooked. Handling missing values poses many challenges, and there is a vast literature in the statistical community, with many implementations available. Yet, there are still many open issues and the need to design new methods or to introduce new point of views: for missing values in a supervised-learning setting, in deep learning architectures, to adapt available methods for high dimensional observed data with different type of missing values, deal with feature mismatch and distribution mismatch. Missing data is one of the eight pillars of causal wisdom for Judea Pearl who brought graphical model reasoning to tackle some missing not at random values.
To the best of our knowledge, this is the first workshop at the major machine learning conferences focusing primarily on missing value problems in recent years. The goal of our workshop is to give more momentum and exposition to research on missing values, both theoretical and methodological, and emphasize the connections with other areas of machine learning (e.g. causal inference, generative modelling, uncertainty quantification, transfer learning, distributional shift, etc.). We will also attach importance to discussing the reproducibility problems that can be caused by missing data, the danger of forgetting the missing values issues and the importance of providing sound implementations.
We welcome both academic and industrial practitioners/researchers. In particular, since missing data is a critical issue in many applications, we would like to federate industrial/applied know-how and various academic approaches.
Schedule
Fri 1:45 a.m. - 2:00 a.m.
|
Opening Session
(
Discussion
)
>
|
Julie Josse · Jes Frellsen · Pierre-Alexandre Mattei · Gael Varoquaux 🔗 |
Fri 2:00 a.m. - 3:00 a.m.
|
Poster session 1
(
Posters
)
>
Please do not share or post Zoom links A Random Matrix Analysis of Learning with α-Dropout Visna---Visualising Multivariate Missing Values Multi-output prediction of global vegetation distribution with incomplete data Path Imputation Strategies for Signature Models Lung Segmentation from Chest X-rays using Variational Data Imputation Clustering Data with nonignorable Missingness using Semi-Parametric Mixture Models Estimating conditional density of missing values using deep Gaussian mixture model Missing the Point: Non-Convergence in Iterative Imputation Algorithms The Dynamic Latent Block Model for Sparse and Evolving Count Matrices Predicting Feature Imputability in the Absence of Ground Truth Missing rating imputation based on product reviews via deep latent variable models Inferring Causal Dependencies between Chaotic Dynamical Systems from Sporadic Time Series The impact of incomplete data on quantile regression for longitudinal data Multi-label Learning with Missing Values using Combined Facial Action Unit Datasets A Study on Intentional-Value-Substitution Training for Regression with Incomplete Information How to miss data? Reinforcement learning for environments with high observation cost How to deal with missing data in supervised deep learning? VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data Working with Deep Generative Models and Tabular Data Imputation |
🔗 |
Fri 4:30 a.m. - 5:10 a.m.
|
Invited Talk: Learning despite the unknown - missing data imputation in healthcare
(
Talk
)
>
link
SlidesLive Video https://slideslive.com/38930914/learning-despite-the-unknown-missing-data-imputation-in-healthcare?ref=account-folder-55866-folders |
Mihaela van der Schaar 🔗 |
Fri 5:10 a.m. - 5:50 a.m.
|
Invited Talk: Imputing Missing Data with the Gaussian Copula
(
Talk
)
>
link
Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. This talk introduces a new semiparametric algorithm to impute missing values, with no tuning parameters. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of the proposed algorithm to state-of-the-art imputation algorithms for mixed data. |
Madeleine Udell 🔗 |
Fri 5:50 a.m. - 6:30 a.m.
|
Discussion and Q&A by Gael Varoquaux, Julie Josse and Pierre Alexandre Mattei
(
Discussion Panel
)
>
|
🔗 |
Fri 6:30 a.m. - 7:10 a.m.
|
Invited Talk: Efficient Missing-value Acquisition with Variational Autoencoders
(
Talk
)
>
link
SlidesLive Video Abstract: In many real-world problems we have to make predictions from feature vectors with missing values. However, we may also be able to observe some of the missing values in the feature vector at a cost. Given the currently observed values, how can we decide which missing values to observe next so that prediction accuracy increases as fast as possible as a function of the observation cost? This problem appears in many different application areas, including medical diagnosis, surveys, recommender systems, insurance, etc. In this talk, I will describe how to solve the problem using an information theoretic approach and novel variational autoencoder models that can effectively deal with missing data. |
Jose Miguel Hernandez-Lobato 🔗 |
Fri 7:10 a.m. - 7:50 a.m.
|
Invited Talk: What Interpretable Machine Learning Can Tell Us About Missing Values
(
Talk
)
>
link
SlidesLive Video Missing values are everywhere, and I’ve been dealing with them one way or another for many years. Recently I’ve been doing research in interpretable machine learning. To my surprise, interpretable machine learning has completely changed how I work with missing values. Interpretable learning provides new methods for detecting, understanding, and modeling missing values. In the presentation I’ll show a few surprises where interpretability makes it clear the impact missing values have been having on our machine learning models all along, but which are only visible now thanks to interpretable methods. |
Rich Caruana 🔗 |
Fri 7:50 a.m. - 8:30 a.m.
|
Discussion and Q&A by Gael Varoquaux and Jes Frellsen
(
Discussion
)
>
|
🔗 |
Fri 8:30 a.m. - 9:10 a.m.
|
Poster session 2
(
Poster
)
>
Please do not share or post Zoom links Optimal recovery of missing values for non-negative matrix factorization: A probabilistic error bound Causal Discovery in the Presence of Missing Values for Neuropathic Pain Diagnosis Does imputation matter? Benchmark for real-life classification problems. VAEs in the Presence of Missing Data Variance estimation after Kernel Ridge Regression Imputation Online Mixed Missing Value Imputation Using Gaussian Copula Imputation of Missing Behavioral Measures in Connectome-based Predictive Modelling Handling Missing Data in Decision Trees: A Probabilistic Approach Processing of incomplete images by (graph) convolutional neural networks Conditioning on "and nothing else": Simple Models of Missing Data between Naive Bayes and Logistic Regression Multi-Time Attention Networks for Irregularly Sampled Time Series Information Theoretic Approaches for Testing Missingness in Predictive Models |
🔗 |
Fri 9:10 a.m. - 9:50 a.m.
|
Invited Talk: Graphical Models based Solutions for Missing Data Problems.
(
Talk Live
)
>
link
“Missingness Graphs” (m-graphs) are causal graphical models used for processing missing data. They portray the causal mechanisms responsible for missingness and thus encode knowledge about the underlying process that generates data. Using m-graphs, we develop methods to determine if there exists a consistent estimator for a given quantity of interest such as joint distributions, conditional distributions and causal effects. Our methods apply to all types of missing data including the notorious and relatively unexplored NMAR (Not Missing At Random) category. We further address the question of testability i.e. if and how an assumed model can be subjected to statistical tests, considering the missingness in the data. Viewing the missing data problem from a causal perspective has ushered in several surprises such as recoverability when variables are causes of their own missingness, testability of MAR models and the indispensability of causal assumptions for handling missing data problems. |
Karthika Mohan 🔗 |
Fri 9:50 a.m. - 10:30 a.m.
|
Invited Talk: Sequentially additive nonignorable missing data modelling using auxiliary marginal information
(
Talk Live
)
>
We study a class of missingness mechanisms, referred to as sequentially additive nonignorable, for modelling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We prove identification results and illustrate the use of these mechanisms in an application. Paper: https://academic.oup.com/biomet/article-abstract/106/4/889/5607583 In case any issue with live talk: https://washington.zoom.us/rec/share/1MJpC7_px35IGYXA9E3Dc4F9QoTMX6a82iYY-qINmhpn5YZJF5wb7duN3Jf-WKpd |
Mauricio Sadinle 🔗 |
Fri 10:30 a.m. - 11:10 a.m.
|
Discussion and Q&A by Ilya Shpitser - Identifiability of the full law in graphical missing data models
(
Discussion Panel
)
>
|
🔗 |
Fri 11:10 a.m. -
|
Informal gathering with drinks to celebrate
(
Discussion Panel
)
>
|
🔗 |