Approximation Error Upper and Lower Bounds for Hölder Class with Transformers
Xin He ⋅ Yuling Jiao ⋅ Xiliang Lu ⋅ Jerry Yang
Abstract
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/{\alpha}})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $\alpha\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $\mathcal{O}(\varepsilon^{-{d_{0}}/({4\alpha})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.
Successful Page Load