On the Effect of Misspecifying the Embedding Dimension in Low-rank Network Models
Abstract
As network data has become ubiquitous in the sciences, there has been growing interest in network models whose structure is driven by latent node-level variables in a (typically low-dimensional) latent geometric space. These "latent positions" are often estimated via embeddings, whereby the nodes of a network are mapped to points in Euclidean space so that "similar" nodes are mapped to nearby points. Under certain model assumptions, these embeddings are consistent estimates of the latent positions, but most such results require the embedding dimension to be chosen correctly. Methods for choosing the embedding dimension have been studied extensively, but little is known about the behavior of embeddings when the dimension is misspecified. In this work, we provide a theoretical description of the effects of dimension misspecification under the random dot product graph, a class of latent space network models that includes several widely-used network models, most notably the stochastic blockmodel, as special cases. We show that when the dimension is chosen too large, consistent estimation still holds, albeit at a slower rate than when the embedding dimension is chosen correctly. On the other hand, when the dimension is chosen too small, there is a fundamental estimation error lower bound that need not go to zero in the large-network limit. A range of synthetic data experiments support our theoretical results. Our main technical result, which may be of independent interest, is a generalization of earlier work in random matrix theory showing that all non-signal eigenvectors of a low-rank matrix subject to additive noise are delocalized.