Scalable Bayesian Semi-supervised Clustering with Feature Selection and Adaptive Constraint Weighting
Abstract
Constrained clustering incorporates prior knowledge in the form of pairwise constraints to guide data partitioning. While effective, existing Bayesian approaches are often limited in scalability to large datasets and provide weak interpretability due to the lack of explicit feature relevance modeling. We propose BASIL, a scalable Bayesian semi-supervised clustering framework that leverages stochastic variational inference to jointly infer cluster assignments and feature importance weights. This joint formulation enables the identification of discriminative features consistent with the imposed constraints. To robustly handle noisy or inconsistent supervision, BASIL introduces an adaptive constraint-weighting mechanism that down-weights unreliable constraints. Experiments on synthetic and real-world benchmarks demonstrate that our approach achieves competitive clustering performance while improving scalability and interpretability over existing baselines. We further demonstrate applicability to large-scale health data, including medical imaging and electronic health records.