Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of Sequence Modeling Architectures

QSMixer: Connecting SSMs with Mixer Models via Quasi-Separable Matrices

Ali Behrouz · Michele Santacatterina · Ramin Zabih


Abstract: Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Recently, State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present a new non-causal heuristic of the S6 block using quasi-separable kernels with a hardware-friendly implementation that is nearly $\times 1.8$ faster than its original implementation. Using this formulation, we present Quasi-Separable Mixer (QSMixer) that repeatedly mixes information along the sequence and model dimension (channel) axes. As a proof of concept, we design Vision QSMixer (ViQS) architecture for vision tasks. Our evaluation of QSMixer in image classification, segmentation, and object detection underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).

Chat is not available.