FedSSM: State Space Model-based Proactive Inference for Heterogeneous Multimodal Federated Learning
Abstract
Multimodal Federated Learning (MMFL) addresses collaborative training across clients with heterogeneous modality configurations, where effective client selection becomes critical under the compounded challenges of modality, distribution, and quantity heterogeneity. Existing selection methods operate within a reactive paradigm, responding to current observations without anticipating how decisions influence future optimization trajectories. This myopic approach leads to suboptimal convergence when training dynamics shift rapidly under severe heterogeneity. We propose FedSSM, which reconceptualizes client selection as a proactive decision-making process by predicting training dynamics through decision-aware state space models. The prediction error yields a \emph{surprise} signal that quantifies uncertainty and governs adaptive participation budgets and exploration-exploitation trade-offs via counterfactual reasoning over candidate actions. For aggregation, we introduce trust-weighted fusion with modality-specific routing, where surprise calibrates sensitivity to client anomalies. Experiments on four multimodal benchmarks demonstrate that FedSSM achieves 2.5--4.5\% accuracy improvements over state-of-the-art methods while reducing communication rounds by over 30\%.