Caracal: Causal Architecture via Spectral Mixing
BINGZHENG GAN ⋅ Tianyi Zhang ⋅ LI YUSU ⋅ Jing Huang ⋅ Wei Shi ⋅ Yangkai Ding ⋅ Tao Yu
Abstract
The scalability of Large Language Models to long sequences is hindered by the quadratic cost of self-attention and the limitations of positional encodings. To address these, we introduce **Caracal**, a novel architecture that replaces self-attention with a parameter-efficient, $\mathcal{O}(L \log L)$ Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), **Caracal** uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that **Caracal** performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in the supplementary materials.
Successful Page Load