Scaling Vision Transformers for Functional MRI with Flat Maps
Abstract
We propose a simple strategy for training a foundation model on functional MRI (fMRI) data: we adapt the standard Vision Transformer to fMRI by first converting each 3D fMRI volume to a 2D map using a standard cortical flat map projection. We train spatiotemporal masked autoencoders (MAE) on 2.3K hours of fMRI flat map videos. Our model (CortexMAE) outperforms identical MAE models trained on parcel-averaged or native volume data. We perform the first quantitative scaling analyses for fMRI and observe strict power law scaling. Finally, we develop the first open evaluation suite for fMRI foundation models and use it to perform a comprehensive comparison. On cognitive state decoding, our model outperforms all models by a wide margin. On clinical trait prediction, however, we report an important mixed result: all models show inconsistent performance (including our own). We hope that by introducing reproducible benchmarks and a strong, simple baseline, we can help establish a clear frontier for fMRI foundation models. Code is available at \url{https://anonymous.4open.science/r/cortex_mae}.