Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

Optimal Transfer Learning for Missing Not-at-Random Matrix Completion

Akhil Jalan · Yassir Jedra · Arya Mazumdar · Soumendu Sundar Mukherjee · Purnamrita Sarkar

[ OpenReview]

Abstract

We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we usea noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both the *active* and *passive* sampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need for *incoherence* assumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets.

Lay Summary

The problem that prompted our research is matrix completion, which takes a noisy and incomplete matrix as input and requires a full matrix as output. Matrix completion arises in many application areas; our motivation comes from missing data in biological settings such as gene sequencing, metabolic network construction, and companion diagnostics. In these settings, entire rows and columns of the data matrix can be missing, making traditional matrix completion algorithms ineffective. We formulate this specific kind of matrix completion problem as a transfer learning problem, in which we have access to a source matrix P as well as target matrix Q. The matrix Q typically has more observational noise and missing entries, such as when Q corresponds to the metabolic network of a rarely studied species while P is that of a more common species. We then present optimal estimation algorithms for Q in two settings: active sampling (where we have to decide what entries of Q to observe) and passive sampling (where we simply have some pre-observed Q). The algorithms are optimal in the sense of achieving the best possible estimation error, given a certain amount of data and underlying data distribution. This research matters because it can directly contribute to applied studies in biostatistics and bioinformatics, as we demonstrate in our own experiments. Additionally, we make progress on transfer learning, which is an important area of machine learning in its own right.

Video

Chat is not available.