Poster
in
Workshop: Neural Compression: From Information Theory to Applications
Transformers are Universal Predictors
Sourya Basu · Moulik Choraria · Lav Varshney
Abstract:
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
Virtual talk: https://drive.google.com/file/d/1wx45om05jQrkFvyVZoWxoGcEr41IeUP/view?usp=drivelink
Chat is not available.