Session
Statistical Learning Theory 3
Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization
Ibrahim Alabdulmohsin
In this paper, we derive bounds on the mutual information of the empirical risk minimization (ERM) procedure for both 0-1 and strongly-convex loss classes. We prove that under the Axiom of Choice, the existence of an ERM learning rule with a vanishing mutual information is equivalent to the assertion that the loss class has a finite VC dimension, thus bridging information theory with statistical learning theory. Similarly, an asymptotic bound on the mutual information is established for strongly-convex loss classes in terms of the number of model parameters. The latter result rests on a central limit theorem (CLT) that we derive in this paper. In addition, we use our results to analyze the excess risk in stochastic convex optimization and unify previous works. Finally, we present two important applications. First, we show that the ERM of strongly-convex loss classes can be trivially scaled to big data using a naive parallelization algorithm with provable guarantees. Second, we propose a simple information criterion for model selection and demonstrate experimentally that it outperforms the popular Akaike's information criterion (AIC) and Schwarz's Bayesian information criterion (BIC).
The Generalization Error of Dictionary Learning with Moreau Envelopes
ALEXANDROS GEORGOGIANNIS
This is a theoretical study on the sample complexity of dictionary learning with a general type of reconstruction loss. The goal is to estimate a $m \times d$ matrix $D$ of unit-norm columns when the only available information is a set of training samples. Points $x$ in $\mathbb{R}^m$ are subsequently approximated by the linear combination $Da$ after solving the problem $\min_{a \in \mathbb{R}^d}~ \Phi(x - Da) + g(a)$; function $g:\mathbb{R}^d \to [0,+\infty)$ is either an indicator function or a sparsity promoting regularizer. Here is considered the case where $ \Phi(x) = \inf_{z \in \mathbb{R}^m}~{ ||x-z||_2^2 + h(||z||_2)}$ and $h$ is an even and univariate function on the real line. Connections are drawn between $\Phi$ and the Moreau envelope of $h$. A new sample complexity result concerning the $k$-sparse dictionary problem removes the spurious condition on the coherence of $D$ appearing in previous works. Finally, comments are made on the approximation error of certain families of losses. The derived generalization bounds are of order $\mathcal{O}(\sqrt{\log n /n})$ and valid without any further restrictions on the set of dictionaries with unit-norm columns.
On Learning Sparsely Used Dictionaries from Incomplete Samples
Thanh Nguyen · Akshay Soni · Chinmay Hegde
Existing algorithms for dictionary learning assume that the entries of the (high-dimensional) input data are fully observed. However, in several practical applications, only an incomplete fraction of the data entries may be available. For incomplete settings, no provably correct and polynomial-time algorithm has been reported in the dictionary learning literature. In this paper, we provide provable approaches for learning -- from incomplete samples -- a family of dictionaries whose atoms have sufficiently ``spread-out'' mass. First, we propose a descent-style iterative algorithm that linearly converges to the true dictionary when provided a sufficiently coarse initial estimate. Second, we propose an initialization algorithm that utilizes a small number of extra fully observed samples to produce such a coarse initial estimate. Finally, we theoretically analyze their performance and provide asymptotic statistical and computational guarantees.
We study the complexity of the entire regularization path for least squaresregression with 1-norm penalty, known as the Lasso. Every regressionparameter in the Lasso changes linearly as a function of the regularizationvalue. The number of changes is regarded as the Lasso's complexity.Experimental results using exact path following exhibitpolynomial complexity of the Lasso in the problem size. Alas, the pathcomplexity of the Lasso on artificially designed regression problems isexponentialWe use smoothed analysis as a mechanism for bridging the gapbetween worst case settings and the de facto low complexity. Our analysisassumes that the observed data has a tiny amount of intrinsic noise.We then prove that the Lasso's complexity is polynomial in the problem size.
Differentially Private Identity and Equivalence Testing of Discrete Distributions
Maryam Aliakbarpour · Ilias Diakonikolas · MIT Ronitt Rubinfeld
We study the fundamental problems of identity and equivalence testing over a discrete populationfrom random samples. Our goal is to develop efficient testers while guaranteeingdifferential privacy to the individuals of the population. We provide sample-efficient differentially private testers for these problems.Our theoretical results significantly improve over the best known algorithms for identity testing, and are the first results for private equivalence testing. The conceptual message of our work is thatthere exist private hypothesis testers that are nearly as sample-efficient as their non-private counterparts. We perform an experimental evaluation of our algorithms on synthetic data. Our experiments illustrate that our private testers achieve small type \rom{1} and type \rom{2} errors with sample size {\em sublinear} in the domain size of the underlying distributions.