Semi-Supervised Learning with Noisy Covariates: Generalization Bounds and Distribution Regression
Abstract
In modern machine learning pipelines, abundant pretrained representations act as noisy proxy covariates while task-specific labels remain scarce. We study semi-supervised regression in this noisy-covariate setting and propose a simple two-stage estimator. We derive finite-sample generalization bounds showing that sufficiently many unlabeled proxy covariates can yield fast labeled-sample rates for both well-specified and misspecified regression settings. We further show that distribution regression is a special case of our framework, where each covariate is a latent distribution observed through a finite bag of samples, and the same guarantees hold when the bag size is large enough. Numerical experiments demonstrate consistent improvements over competitive supervised and semi-supervised baselines, especially in low-label regimes.