Skip to yearly menu bar Skip to main content


Poster

Mechanistic Design and Scaling of Hybrid Architectures

Michael Poli · Armin Thomas · Eric Nguyen · Stefano Massaroli · Pragaash Ponnusamy · Björn Deiseroth · Kristian Kersting · Taiji Suzuki · Brian Hie · Stefano Ermon · Christopher Re · Ce Zhang


Abstract:

The development of improved deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at scale model training and evaluation. We set out to demystify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests and scaling laws. Using a set of synthetic token manipulation tasks, designed to probe specific skills of a model architecture such as compression and recall, we identify new, improved hybrid architectures built from a variety of computational primitives. Underpinning our approach is the concept of state-optimality, which we introduce as a measure of utilization of finite-dimensional states in models based on recurrences and convolutions. We experimentally validate new architectures via an extensive compute-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Our new architectures found via MAD, based on simple ideas such as hybridization and routing, outperform state-of-the-art Transformer, convolutional and recurrent architectures (Transformer++, Hyena, Mamba) in compute and state scaling. Overall, these results provide evidence that performance on synthetic tasks can predict performance at scale, and that an optimal architecture should include different specialized layers via hybridization.

Live content is unavailable. Log in and register to view live content