Skip to yearly menu bar Skip to main content

Workshop: Subset Selection in Machine Learning: From Theory to Applications

Tighter m-DPP Coreset Sample Complexity Bounds

Gantavya Bhatt · Jeff Bilmes


Many machine learning problems involve processing large datasets, thus making them expensive computational tasks. One mitigation strategy selects representative subsets of the data and one way to achieve this is via coresets~- a weighted subset of the given data, such that the cost function under consideration can be additively and/or multiplicatively approximated. While much previous work chooses these subsets using independent importance-based sampling, such methods can be suboptimal when the system possesses redundancy. Recently, Tremblay et al. 2019 considered sampling from fixed-sized determinantal point process (DPP) and gave lower bounds on the coreset sample complexity. While DPPs often empirically outperform importance-based sampling methods, their coreset size bound is worse compared to independent sampling. In this work, we extend their theoretical results by improving the lower bound on sample complexity for coreset approximation by a log(1 / epsilon) factor using chaining, a technique in empirical process literature. This, therefore, moves a step further in bridging the gap between theoretical justification and empirical observations. We provide additive approximation guarantees for the associated cost function and allow the possibility of negative weights, which has a direct application in layer-wise pruning of deep neural networks. On top of additive approximation guarantees, we also extend our bounds to the scenarios with multiplicative approximations.