ICML Towards smaller language models via layer looping

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Towards smaller language models via layer looping

Sabri Eyuboglu · Dylan Zinsley · Jon Saad-Falcon · Simran Arora · Atri Rudra · James Zou · Christopher Re

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Language models store a huge amount of knowledge in their parameters. This dominant architecture bears little resemblance to the implementations of optimized data stores (e.g. a database management system like PostgreSQL), which begs the question: are there other architectures that can store and query the same information more efficiently? In this work, we explore two simple modifications to the standard architecture: looping --- sharing parameters across layers --- and mixture-of-experts (MoE). We compare the space complexity of standard and looped-moe models on a simple task where the model must memorize a knowledge graph (KG) and answer multi-hop queries over it. We prove that the looped-moe can store a KG of size $T$ and answer $q$-hop queries with $\mathcal{O}(T)$ parameters. In contrast, the best known upper bound for the standard model is $\mathcal{O}(qT)$ parameters. We confirm this scaling with experiments on synthetic KGs, finding that looped-conditional models can reliably answer four-hop queries over KGs that are $9\times$ larger than parameter-matched standard models can.

Chat is not available.

Poster in Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Towards smaller language models via layer looping

Sabri Eyuboglu · Dylan Zinsley · Jon Saad-Falcon · Simran Arora · Atri Rudra · James Zou · Christopher Re

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models