Unveiling the Role of Data Uncertainty in Tabular Deep Learning
Abstract
Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data (aleatoric) uncertainty for explaining the effectiveness of recent tabular DL methods. While data uncertainty leads to irreducible prediction errors on test samples, it also introduces stochasticity into the training signal that can impede effective learning. We demonstrate that tabular methods differ significantly in their ability to cope with this optimization challenge. Specifically, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, advanced ensembling strategies, retrieval-augmented models, and tabular Prior-Fitted Networks, can be partially attributed to their respective implicit mechanisms for performing well under high data uncertainty. By dissecting these varied mechanisms, we provide a unifying understanding of recent performance improvements. Furthermore, leveraging insights from this perspective, we design a novel, more effective numerical feature embedding method as an immediate practical outcome of our analysis. Overall, our work paves the way toward a principled understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.