Abstract:
As deep generative models have progressed, recent work has shown that they are capable of memorizing and reproducing training datapoints when deployed. These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization. To better understand this phenomenon, we propose a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization. We propose to analyze memorization in terms of the relationship between the dimensionalities of $(i)$ the ground truth data manifold and $(ii)$ the manifold learned by the model. In preliminary tests on toy examples and Stable Diffusion (Rombach et al., 2022), we show that our theoretical framework accurately describes reality. Furthermore, by analyzing prior work in the context of our geometric framework, we explain and unify assorted observations in the literature and illuminate promising directions for future research on memorization.