Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Neural Compression: From Information Theory to Applications

Transformers are Universal Predictors

Sourya Basu · Moulik Choraria · Lav Varshney


Abstract:

We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

Virtual talk: https://drive.google.com/file/d/1wx45om05jQrkFvyVZoWxoGcEr41IeUP/view?usp=drivelink

Chat is not available.