Timezone: »
Given data drawn from an unknown distribution, D, to what extent is it possible to amplify'' this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n
samples'' which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a ddimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to nontrivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.
Author Information
Brian Axelrod (Stanford)
Shivam Garg (Stanford University)
Vatsal Sharan (Stanford University)
Gregory Valiant (Stanford University)
More from the Same Authors

2021 : Subpopulation Guarantees for Importance Weights and KLDivergence Estimation »
Parikshit Gopalan · Nina Narodytska · Omer Reingold · Vatsal Sharan · Udi Wieder 
2021 Poster: Sinkhorn Label Allocation: SemiSupervised Classification via Annealed SelfTraining »
Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2021 Spotlight: Sinkhorn Label Allocation: SemiSupervised Classification via Annealed SelfTraining »
Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2021 Poster: Reward Identification in Inverse Reinforcement Learning »
Kuno Kim · Shivam Garg · Kirankumar Shiragur · Stefano Ermon 
2021 Spotlight: Reward Identification in Inverse Reinforcement Learning »
Kuno Kim · Shivam Garg · Kirankumar Shiragur · Stefano Ermon 
2020 Poster: Causal Strategic Linear Regression »
Yonadav Shavit · Benjamin Edelman · Brian Axelrod 
2019 Poster: Compressed Factorization: Fast and Accurate LowRank Factorization of CompressivelySensed Data »
Vatsal Sharan · Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2019 Poster: Equivariant Transformer Networks »
Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2019 Oral: Equivariant Transformer Networks »
Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2019 Oral: Compressed Factorization: Fast and Accurate LowRank Factorization of CompressivelySensed Data »
Vatsal Sharan · Kai Sheng Tai · Peter Bailis · Gregory Valiant 
2019 Poster: Maximum Likelihood Estimation for Learning Populations of Parameters »
Ramya Korlakai Vinayak · Weihao Kong · Gregory Valiant · Sham Kakade 
2019 Oral: Maximum Likelihood Estimation for Learning Populations of Parameters »
Ramya Korlakai Vinayak · Weihao Kong · Gregory Valiant · Sham Kakade 
2017 Poster: Estimating the unseen from multiple populations »
Aditi Raghunathan · Greg Valiant · James Zou 
2017 Poster: Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use »
Vatsal Sharan · Gregory Valiant 
2017 Talk: Estimating the unseen from multiple populations »
Aditi Raghunathan · Greg Valiant · James Zou 
2017 Talk: Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use »
Vatsal Sharan · Gregory Valiant