Poster

Latent Score-Based Reweighting for Robust Classification on Imbalanced Tabular Data

Yunze Tong · Fengda Zhang · Zihao Tang · Kaifeng Gao · Kai Huang · Pengfei Lyu · Jun Xiao · Kun Kuang

2025 Poster

Project Page [ Poster] [ OpenReview]

Abstract

Machine learning models often perform well on tabular data by optimizing average prediction accuracy. However, they may underperform on specific subsets due to inherent biases and spurious correlations in the training data, such as associations with non-causal features like demographic information. These biases lead to critical robustness issues as models may inherit or amplify them, resulting in poor performance where such misleading correlations do not hold. Existing mitigation methods have significant limitations: some require prior group labels, which are often unavailable, while others focus solely on the conditional distribution (P(Y|X)), upweighting misclassified samples without effectively balancing the overall data distribution (P(X)). To address these shortcomings, we propose a latent score-based reweighting framework. It leverages score-based models to capture the joint data distribution (P(X, Y)) without relying on additional prior information. By estimating sample density through the similarity of score vectors with neighboring data points, our method identifies underrepresented regions and upweights samples accordingly. This approach directly tackles inherent data imbalances, enhancing robustness by ensuring a more uniform dataset representation. Experiments on various tabular datasets under distribution shifts demonstrate that our method effectively improves performance on imbalanced data.

Lay Summary

Machine learning models are great at making predictions when the data they're trained on closely resembles what they see in practice. But sometimes, these models don’t perform well for certain groups of people or situations. This happens because the models learn from patterns in the training data—even if those patterns are misleading. For example, they might wrongly assume that someone's background is tied to a particular outcome just because of how the data was collected.Our research introduces a new way to help models do better in these cases, without needing extra labels or prior knowledge about the data. We teach the model to pay more attention to data points that are rare or overlooked, by using a score-based model to provide proxy for them. This makes the overall dataset more balanced and helps the model learn fairer, more accurate predictions. We tested our approach on several datasets and found that it helps the models stay reliable, even when the data shifts or becomes more diverse.

Video

Chat is not available.