Approximation Bounds for Transformer Networks with Application to Regression
Yuling Jiao ⋅ Yanming Lai ⋅ Defeng Sun ⋅ Yang Wang ⋅ Bokai Yan
Abstract
We develop approximation and statistical theory for standard Transformer networks in sequence modeling. Given a sequence-to-sequence target on $[0,1]^{d_x \times n}$ whose entries are $\gamma$-H\"older for $\gamma \in (0,1]$ or belong to a first-order Sobolev class, we establish explicit $L^p$-approximation bounds for all $p \in [1,\infty]$, including the previously elusive endpoint $p=\infty$ under softmax attention. In particular, achieving error $\varepsilon$ in $L^p$-norm requires $\mathcal{O}(\varepsilon^{-d_x n/\gamma})$ parameters for $\gamma$-H\"older targets and $\mathcal{O}(\varepsilon^{-d_x n})$ parameters for Sobolev targets, matching the best known scalings in ambient dimension $d_x n$. We further study nonparametric regression with sequential and dependent observations using Transformer networks. Assuming stationary $\beta$-mixing covariates whose temporal dependence weakens over time, we analyze a sliding-window empirical risk minimization procedure and establish excess-risk guarantees for the Transformer-based estimators. Our analysis clarifies the role of attention and enables extensions beyond softmax.
Successful Page Load