Poster
in
Workshop: ES-FoMo: Efficient Systems for Foundation Models
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
Yu Bai · Fan Chen · Huan Wang · Caiming Xiong · Song Mei
This work advances the understandings of the remarkable \emph{in-context learning} (ICL) abilities of transformers---the ability of performing new tasks when prompted with training and test examples, without any parameter update to the model. We begin by showing that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, convex risk minimization for generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Our transformer constructions admit mild bounds on the number of layers and heads, and can be learned with polynomially many pretraining sequences. Building on these base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life---A \emph{single} transformer can adaptively select different base ICL algorithms---or even perform qualitatively different tasks---on different input sequences, without any explicit prompting of the right algorithm or task. In theory, we construct two general mechanisms for algorithm selection with concrete examples: (1) Pre-ICL testing, where the transformer determines the right task for the given sequenceby examining certain summary statistics of the input sequence; (2) Post-ICL validation, where the transformer selects---among multiple base ICL algorithms---a near-optimal one for the given sequence using a train-validation split. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.