ICML Transformers are Universal Predictors

Poster
in
Workshop: Neural Compression: From Information Theory to Applications

Transformers are Universal Predictors

Sourya Basu · Moulik Choraria · Lav Varshney

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

Virtual talk: https://drive.google.com/file/d/1wx45om05jQrkFvyVZoWxoGcEr41IeUP/view?usp=drivelink

Chat is not available.

Poster in Workshop: Neural Compression: From Information Theory to Applications

Transformers are Universal Predictors

Sourya Basu · Moulik Choraria · Lav Varshney

Poster
in
Workshop: Neural Compression: From Information Theory to Applications