The data manifold under the microscope
Marios Koulakis ⋅ Constantin Seibold
Abstract
A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry by repurposing and extending dSprites and COIL-20 with additional transformation dimensions and denser sampling, enabling accurate finite-difference estimates of curvature, reach, and volume that are otherwise difficult to estimate reliably and implement in practice. As applications, we assess bounds by Genovese et al. and Fefferman et al., and analyze how geometry evolves across network layers in $\beta$-VAEs, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. Code to reproduce the framework and experiments is included with the submission and will be released as open-source library upon publication.
Successful Page Load