Timezone: »

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
Jimmy Ba · Murat Erdogdu · Taiji Suzuki · Zhichao Wang · Denny Wu
We consider the learning of a single-index target function $f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data: $f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle)$, $\boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top),$ where the link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ is a degree-$p$ polynomial with information exponent $k$ (defined as the lowest degree in the Hermite expansion of $\sigma_*$), and it depends on the projection of input $\boldsymbol{x}$ onto the spike (signal) direction $\boldsymbol{\mu}\in\mathbb{R}^d$. In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n,d\to\infty, d/n\to\gamma\in(0,\infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i.e., strength of the low-dimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural network trained with gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\theta = \Omega\big(d^{1-\frac{1}{p}}\big)$ is both sufficient and necessary. Whereas for GD-trained two-layer neural network, $\theta=\Omega\big(d^{1-\frac{1}{k}}\big)$ suffices. Our result demonstrates that both kernel method and neural network benefit from low-dimensional structures in the data; moreover, since $k\le p$ by definition, neural network can adapt to such structure more effectively.