TabularBERT: Binning-Based Self-Supervised Learning for Tabular Representation
Abstract
Tabular data is one of the most fundamental and widely used formats for representing structured information. Classical machine learning algorithms continue to achieve substantial success in extracting predictive patterns and constructing accurate models from structured data; however, representation learning approaches that extend language-model-based methods to the tabular setting have opened new opportunities. Nevertheless, conventional tokenization procedures and token embedding mechanisms are not well-suited to numerical variables, as they fail to preserve key numerical properties, including proximity structure and ordinal relationships. To address this limitation, we propose TabularBERT, a Transformer-based model that discretizes numerical variables via binning-based tokenization and learns representations that preserve numerical proximity and ordinal information while capturing conditional dependencies among variables through masked self-supervised pretraining. We empirically demonstrate the effectiveness and interpretability of the proposed approach, highlighting the benefits of language-model-based representation learning in the tabular domain.