UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation
haoyuan liang ⋅ Zhiyu Ye ⋅ Jielong Tang ⋅ Yang Yang ⋅ Shilei Cao ⋅ Guowen Li ⋅ Fei Hu ⋅ Zhiwei Zhang ⋅ Haohuan Fu ⋅ Juepeng Zheng
Abstract
As Multimodal Large Language Models (MLLMs) continue to be trained, the availability of public data diminishes, limiting the possibility for further training and adaptation. However, private data remains an underutilized yet valuable resource. Federated Learning (FL) enables decentralized training on private data, yet extending it to MLLMs is challenging: heterogeneous client modalities induce architectural incompatibility, and full-parameter fine-tuning of billion-scale models incurs prohibitive communication costs. Parameter-efficient methods like LoRA alleviate these issues but introduce aggregation inconsistency, as averaged low-rank updates fail to recover the true global update faithfully. To address these issues, we propose **UniFLoW**(Universal multi-modal Federated LoRA fine-tuning framework With Analytical Aggregation), a unified federated framework that leverages pre-trained large language models and a multi-modal Encoder architecture, and our proposed Federated Aggregating Analytical Low-Rank Adaption$FedA^2$-$LoRA$). UniFLoW effectively utilizes fragmented client-side multi-modal data while $FedA^2$-$LoRA$ ensuring consistent aggregation. And modality-specific encoders and a II stage training strategy ensure effective integration of diverse modalities without overfitting. Experiments on text, image, and speech demonstrate that **UniFLoW** enables scalable, communication-efficient, and aggregation-consistent federated fine-tuning, with $FedA^2$-$LoRA$ achieving state-of-the-art performance compared to existing FedLoRA approaches. We envision UniFLoW as a promising solution to the growing scarcity of public data.
Successful Page Load