The empirical success of stateoftheart machine learning (ML) techniques has outpaced their theoretical understanding. Deep learning models, for example, perform far better than classical statistical learning theory predicts, leading to its widespread use by Industry and Government. At the same time, the deployment of ML systems that are not fully understood often leads to unexpected and detrimental individuallevel impact. Finally, the largescale adoption of ML means that ML systems are now critical infrastructure on which millions rely. In the face of these challenges, there is a critical need for theory that provides rigorous performance guarantees for practical ML models; guides the responsible deployment of ML in applications of social consequence; and enables the design of reliable ML systems in largescale, distributed environments.
For decades, information theory has provided a mathematical foundation for the systems and algorithms that fuel the current data science revolution. Recent advances in privacy, fairness, and generalization bounds demonstrate that information theory will also play a pivotal role in the next decade of ML applications: informationtheoretic methods can sharpen generalization bounds for deep learning, provide rigorous guarantees for compression of neural networks, promote fairness and privacy in ML training and deployment, and shed light on the limits of learning from noisy data.
We propose a workshop that brings together researchers and practitioners in ML and information theory to encourage knowledge transfer and collaboration between the sister fields. For information theorists, the workshop will highlight novel and sociallycritical research directions that promote reliable, responsible, and rigorous development of ML. Moreover, the workshop will expose ICML attendees to emerging informationtheoretic tools that may play a critical role in the next decade of ML applications.
Sat 7:00 a.m.  7:15 a.m.

Opening Remarks
(Intro & Welcome)
SlidesLive Video » 
Ahmad Beirami 🔗 
Sat 7:15 a.m.  8:00 a.m.

Virtual Poster Session #1 (Poster Session) link »  🔗 
Sat 8:00 a.m.  8:30 a.m.

Invited Talk: Maxim Raginsky
(Invited Talk)
SlidesLive Video » 
Maxim Raginsky 🔗 
Sat 8:30 a.m.  8:45 a.m.

Q&A: Maxim Raginsky
(Q&A)

🔗 
Sat 8:45 a.m.  9:15 a.m.

Invited Talk: Alex Dimakis
(Invited Talk)
SlidesLive Video » 
Alexandros Dimakis 🔗 
Sat 9:15 a.m.  9:30 a.m.

Q&A: Alex Dimakis
(Q&A)

🔗 
Sat 9:30 a.m.  10:00 a.m.

Small Break
(Break)

🔗 
Sat 10:00 a.m.  10:30 a.m.

Invited Talk: Kamalika Chaudhuri
(Invited Talk)
SlidesLive Video » 
Kamalika Chaudhuri 🔗 
Sat 10:30 a.m.  10:45 a.m.

Q&A: Kamalika Chaudhuri
(Q&A)

🔗 
Sat 10:45 a.m.  11:15 a.m.

Invited Talk: Todd Coleman
(Invited Talk)
SlidesLive Video » 
Todd Coleman 🔗 
Sat 11:15 a.m.  11:30 a.m.

Q&A: Todd Coleman
(Q&A)

🔗 
Sat 11:30 a.m.  11:45 a.m.

Contributed Talk #1
(Contributed Talk)
SlidesLive Video » 
Eric Lei · Hamed Hassani · Shirin Bidokhti 🔗 
Sat 11:45 a.m.  12:00 p.m.

Contributed Talk #2
(Contributed Talk)
SlidesLive Video » 
Borja Rodríguez Gálvez · Mikael Skoglund · Ragnar Thobaben · German Bassi 🔗 
Sat 12:00 p.m.  1:00 p.m.

Big Break
(Break)

🔗 
Sat 1:00 p.m.  1:45 p.m.

Panel Discussion
(Panel)
SlidesLive Video » 
🔗 
Sat 1:45 p.m.  2:15 p.m.

Invited Talk: Kush Varshney
(Invited Talk)
SlidesLive Video » 
Kush Varshney 🔗 
Sat 2:15 p.m.  2:30 p.m.

Q&A: Kush Varshney
(Q&A)

🔗 
Sat 2:30 p.m.  3:00 p.m.

Invited Talk: Thomas Steinke
(Invited Talk)
SlidesLive Video » 
Thomas Steinke 🔗 
Sat 3:00 p.m.  3:15 p.m.

Q&A: Thomas Steinke
(Q&A)

🔗 
Sat 3:15 p.m.  4:00 p.m.

Virtual Poster Session #2 (Poster Session) link »  🔗 
Sat 4:00 p.m.  4:30 p.m.

Invited Talk: Lalitha Sankar
(Invited Talk)
SlidesLive Video » 
Lalitha Sankar 🔗 
Sat 4:30 p.m.  4:45 p.m.

Q&A: Lalitha Sankar
(Q&A)

🔗 
Sat 4:45 p.m.  5:00 p.m.

Contributed Talk #3
(Contributed Talk)
SlidesLive Video » 
Sharu Jose · Osvaldo Simeone 🔗 
Sat 5:00 p.m.  5:15 p.m.

Contributed Talk #4
(Contributed Talk)
SlidesLive Video » 
Mohammad Samragh · Hossein Hosseini · Kambiz Azarian · Farinaz Koushanfar 🔗 
Sat 5:15 p.m.  5:45 p.m.

Invited Talk: David Tse
(Invited Talk)
SlidesLive Video » 
David Tse 🔗 
Sat 5:45 p.m.  6:00 p.m.

Q&A: David Tse
(Q&A)

🔗 
Sat 6:00 p.m.  6:15 p.m.

Concluding Remarks
SlidesLive Video » 
🔗 
Sat 6:15 p.m.  7:00 p.m.

Social Hour link »  🔗 


SingleShot Compression for Hypothesis Testing
(Poster)
[ Visit Poster at Spot D4 in Virtual World ]
Enhanced processing power in the cloud allows constrained devices to offload costly computations: these remote executions call for a new communication paradigm that optimizes performance on the computational task within a rate constraint. We consider a simple binary hypothesis testing scenario where the transmitter performs fixedlength singleshot compression on data sampled from one of two distributions; the receiver performs a hypothesis test on multiple received samples to determine the correct source distribution. We formulate the taskaware compression problem as finding the optimal source coder that maximizes the asymptotic performance of the hypothesis test on the receiver side under a rate constraint. We propose a new source coding strategy that outperforms universal fixedlength singleshot coding scheme for a range of rate constraints. 
Fabrizio Carpi · Siddharth Garg · Elza Erkip 🔗 


When Optimizing fdivergence is Robust with Label Noise
(Poster)
[ Visit Poster at Spot D3 in Virtual World ]
We show when maximizing a properly defined fdivergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of fdivergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different fdivergence functions. With established robustness, this family of fdivergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental evidence. Our code is available at https://github.com/UCSCREAL/Robustfdivergencemeasures. 
Jiaheng Wei · Yang Liu 🔗 


Active privacyutility tradeoff against a hypothesis testing adversary
(Poster)
[ Visit Poster at Spot D2 in Virtual World ]
We consider privacyutility tradeoff for a timeseries data sharing scenario, where a user releases her data containing some personal information in return of a service. We model user's personal information as two correlated random variables, one of them, called the 
Ecenaz Erdemir · Pier Luigi Dragotti · Deniz Gunduz 🔗 


A unified PACBayesian framework for machine unlearning via information risk minimization
(Poster)
[ Visit Poster at Spot D1 in Virtual World ]
Machine unlearning refers to mechanisms that can remove the influence of a subset of training data upon request from a trained model without incurring the cost of retraining from scratch. This paper develops a unified PACBayesian framework for machine unlearning that recovers the two recent design principles  variational unlearning (Nguyen et al., 2020) and forgetting Lagrangian (Golatkar et al., 2020)  as information risk minimization problems (Zhang, 2006). Accordingly, both criteria can be interpreted as PACBayesian upper bounds on the test loss of the unlearned model that take the form of free energy metrics. 
Sharu Jose · Osvaldo Simeone 🔗 


Neural Networkbased Estimation of the MMSE
(Poster)
[ Visit Poster at Spot D0 in Virtual World ]
The minimum meansquare error (MMSE) achievable by optimal estimation of a random variable $S$ given another random variable $T$ is of much interest in a variety of statistical contexts. Motivated by a growing interest in auditing machine learning models for unintended information leakage, we propose a neural networkbased estimator of this MMSE. We derive a lower bound for the MMSE based on the proposed estimator and the Barron constant associated with the conditional expectation of $S$ given $T$. Since the latter is typically unknown in practice, we derive a general bound for the Barron constant that produces order optimal estimates for canonical distribution models.

Mario Diaz · Peter Kairouz · Lalitha Sankar 🔗 


Realizing GANs via a Tunable Loss Function
(Poster)
[ Visit Poster at Spot C6 in Virtual World ]
We introduce a tunable GAN, called $\alpha$GAN, parameterized by $\alpha \in (0,\infty]$, which interpolates between various $f$GANs and Integral Probability Metric based GANs (under constrained discriminator set). We construct $\alpha$GAN using a supervised loss function, namely, $\alpha$loss, which is a tunable loss function capturing several canonical losses. We show that $\alpha$GAN is intimately related to the Arimoto divergence, which was first proposed by \"{O}sterriecher (1996), and later studied by Liese and Vajda (2006). We posit that the holistic understanding that $\alpha$GAN introduces will have practical benefits of addressing both the issues of vanishing gradients and mode collapse.

Gowtham Raghunath Kurri · Tyler Sypherd · Lalitha Sankar 🔗 


True FewShot Learning with Language Models
(Poster)
[ Visit Poster at Spot C5 in Virtual World ]
Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses many heldout examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts"). Here, we evaluate the fewshot ability of LMs when such heldout examples are unavailable, a setting we call true fewshot learning. We test two model selection criteria, crossvalidation and minimum description length, for choosing LM prompts and hyperparameters in the true fewshot setting. On average, both marginally outperform random selection and greatly underperform selection based on heldout examples. Moreover, selection criteria often prefer models that perform significantly worse than randomlyselected ones. We find similar results even when varying number of examples used for selection, the amount of computation, and the selection criterion itself. Overall, our findings suggest that prior work significantly overestimated the true fewshot ability of LMs given the difficulty of fewshot model selection. 
Ethan Perez · Douwe Kiela · Kyunghyun Cho 🔗 


Learning under Distribution Mismatch and Model Misspecification
(Poster)
[ Visit Poster at Spot C4 in Virtual World ]
We study learning algorithms when there is a mismatch between the distributions of the training and test datasets of a learning algorithm. The effect of this mismatch on the generalization error and model misspecification are quantified. Moreover, we provide a connection between the generalization error and the ratedistortion theory, which allows one to use bounds from the ratedistortion theory to derive new bounds on the generalization error and vice versa. In particular, the ratedistortionbased bound strictly improves over the earlier bound by Xu and Raginsky even when there is no mismatch. 
Mohammad Saeed Masiha · Mohammad Reza Aref 🔗 


Soft BIBD and Product Gradient Codes: Coding Theoretic Constructions to Mitigate Stragglers in Distributed Learning
(Poster)
[ Visit Poster at Spot C3 in Virtual World ]
Conventional approaches to mitigate stragglers in distributed synchronous gradient descent include predicting and avoiding stragglers or replicating tasks across workers. However, predicting stragglers require substantial data about the computing nodes in the distributed system, and coding theoretic approaches provide more efficient use of redundant computations to mitigate stragglers. In this paper, we consider a coding theoretic framework, called gradient coding, to mitigate stragglers. Recently, Kadhe et al. proposed a gradient code based on a combinatorial design, called balanced incomplete block design (BIBD), which is shown to outperform many existing gradient codes, but exist for a very limited set of parameters. In this paper, we aim to overcome such limitations and construct gradient codes which exist for a wide range of parameters while retaining the superior performance of BIBD gradient codes.Two such constructions are proposed, one based on a probabilistic construction that relax the stringent BIBD gradient code constraints, and the other based on taking the Kronecker product of existing gradient codes.Theoretical error bounds for worstcase adversarial straggling scenarios are derived. Simulations show that the proposed constructions outperform existing gradient codes with similar redundancy per data piece. 
Animesh Sakorikar · Lele Wang 🔗 


InformationGuided Sampling for LowRank Matrix Completion
(Poster)
[ Visit Poster at Spot C2 in Virtual World ]
The matrix completion problem, which aims to recover a lowrank matrix X from partial, noisy observations of its entries, arises in many machine learning applications. In this work, we present a novel informationtheoretic framework for initial and sequential sampling of matrix entries for noisy matrix completion, based on the maximum entropy sampling principle in Shewry & Wynn (1987). The key novelty in our approach is that it makes use of uncertainty quantification (UQ) – a measure of uncertainty for unobserved entries – to guide the sampling procedure. Our framework reveals new insights on the role of coherence and coding design on sampling for matrix completion. 
Simon Mak · Shaowu Yuchi · Yao Xie 🔗 


Sliced Mutual Information: A Scalable Measure of Statistical Dependence
(Poster)
[ Visit Poster at Spot C1 in Virtual World ]
Mutual information (MI) is a fundamental measure of statistical dependence, with a myriad of applications to information theory, statistics, and machine learning. While it possesses many desirable structural properties, the estimation of highdimensional MI from samples suffers from the curse of dimensionality. Motivated by scalability to high dimensions, this paper proposes \emph{sliced} MI (SMI) as a surrogate measure of dependence. SMI is defined as an average of MI terms between onedimensional random projections. We show that it preserves many of the structural properties of classic MI, while gaining scalable computation and estimation from samples. Furthermore, and in contrast to classic MI, SMI can be created from computation. This violation of the data processing inequality is consistent with modern feature extraction techniques that often result in postprocessed representations that are more useful for inference than the raw data. Our theory is supported by a numerical study of independence testing and feature extraction, which demonstrate the potential gains SMI offers over classic MI for highdimensional inference. 
Ziv Goldfeld · Kristjan Greenewald 🔗 


Minimax Bounds for Generalized Pairwise Comparisons
(Poster)
[ Visit Poster at Spot C0 in Virtual World ]
We introduce the General Pairwise Model (GPM), a general parametric framework for pairwise comparison. Under the umbrella of the exponential family, the GPM unifies many popular models with discrete observations, including the Thurstone (Case V), BerryTerryLuce (BTL) and Ordinal Models, along with models with continuous observations, such as the Gaussian Pairwise Cardinal Model. We establish minimax lower bounds with tight topological dependence. When applied as a special case to the Ordinal Model, our results uniformly improve upon previously known lower bounds and confirms one direction of a conjecture put forth by Shah et al., (2016). Performance guarantees of the MLE for a broad class of GPMs with subgaussian assumptions are given and compared against our lower bounds, showing that in many natural settings the MLE is optimal up to constants. Matching lower and upper bounds (up to constants) are achieved by the Gaussian Pairwise Cardinal Model, suggesting that our lower bounds are bestpossible under the few assumptions we adopt. 
KuanYun Lee · Thomas Courtade 🔗 


Characterizing the Generalization Error of Gibbs Algorithm with Symmetrized KL information
(Poster)
[ Visit Poster at Spot B6 in Virtual World ]
Bounding the generalization error of a supervised learning algorithm is one of the most important problems in learning theory, and various approaches have been developed. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm. Our main contribution is an exact characterization of the expected generalization error of the wellknown Gibbs algorithm in terms of symmetrized KL information between the input training samples and the output hypothesis. Such a result can be applied to tighten existing expected generalization error bound. Our analysis provides more insight on the fundamental role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm. 
Gholamali Aminian · Yuheng Bu · Laura Toni · Miguel Rodrigues · Gregory Wornell 🔗 


Predictionfocused Mixture Models
(Poster)
[ Visit Poster at Spot B5 in Virtual World ]
In several applications, besides getting a generative model of the data, we also want the model to be useful for specific downstream tasks. Mixture models are useful for identifying discrete components in the data, but may not identify components useful for downstream tasks if misspecified; further, current inference techniques often fail to overcome misspecification even when a supervisory signal is provided. We introduce the predictionfocused mixture model, which selects and models input features relevant to predicting the targets. We demonstrate that our approach identifies relevant signal from inputs even when the model is highly misspecified. 
Abhishek Sharma · Sanjana Narayanan · Catherine Zeng · Finale DoshiVelez 🔗 


Tighter Expected Generalization Error Bounds via Wasserstein Distance
(Poster)
[ Visit Poster at Spot B4 in Virtual World ]
This work presents several expected generalization error bounds based on the Wasserstein distance. More specifically, it introduces fulldataset, singleletter, and randomsubset bounds, and their analogous in the randomized subsample setting from Steinke and Zakynthinou (2020).
Moreover, when the loss function is bounded and the geometry of the space is ignored by the choice of the metric in the Wasserstein distance, these bounds recover from below (and thus, are tighter than) current bounds based on the relative entropy. In particular, they generate new, nonvacuous bounds based on the relative entropy.
Therefore, these results can be seen as a bridge between works that account for the geometry of the hypothesis space and those based on the relative entropy, which are agnostic to such geometry.
Furthermore, it is shown how to produce various new bounds based on different information measures (e.g., the lautum information or several $f$divergences) based on these bounds and how to derive similar bounds with respect to the backward channel using the presented proof techniques.

Borja Rodríguez Gálvez · German Bassi · Ragnar Thobaben · Mikael Skoglund 🔗 


DataDependent PACBayesian Bounds in the RandomSubset Setting with Applications to Neural Networks
(Poster)
[ Visit Poster at Spot B3 in Virtual World ]
The PACBayesian framework has proven to be a useful tool to obtain nonvacuous generalization bounds for modern learning algorithms, such as overparameterized neural networks. A known heuristic to tighten such bounds is to use datadependent priors. In this paper, we show how the informationtheoretically motivated randomsubset setting introduced by Steinke & Zakynthinou (2020) enables the derivation of PACBayesian bounds that naturally involve a datadependent prior. We evaluate these bounds for neural networks trained on MNIST and FashionMNIST, and study their dependence on the training set size, the achieved training accuracy, and the effect of randomized labels. 
Fredrik Hellström · Giuseppe Durisi 🔗 


Active Sampling for Binary Gaussian Model Testing in High Dimensions
(Poster)
[ Visit Poster at Spot B2 in Virtual World ]
A sample $X\in\R^{p}$ is generated according to one of the two known $p$dimensional Gaussian models. This paper establishes the informationtheoretic limits of {\bf actively} sampling the onedimensional coordinates of $X$ for detecting its true model as $p\to\infty$. We address three questions pertinent to active model detecting. First, we establish conditions under which the problem is feasible under controlled error rates. Secondly, we characterize the informationtheoretic lower bound on the coordinatelevel sample complexity for model detection. Finally, we provide an active sampling algorithm that progressively identifies the coordinates to be sampled, and prove that this algorithm's sample complexity meets the informationtheoretic limit in the asymptote of large $p$.

Javad Heydari · Ali Tajer 🔗 


Unsupervised Information Obfuscation for Split Inference of Neural Networks
(Poster)
[ Visit Poster at Spot B1 in Virtual World ]
Splitting network computations between the edge device and a server enables low edgecompute inference of neural networks but might expose sensitive information about the test query to the server. To address this problem, existing techniques train the model to minimize information leakage for a given set of sensitive attributes. In practice, however, the test queries might contain attributes that are not foreseen during training. We propose instead an unsupervised obfuscation method to discard the information irrelevant to the main task. We formulate the problem via an information theoretical framework and derive an analytical solution for a given distortion to the model output. In our method, the edge device runs the model up to a split layer determined based on its computational capacity. It then obfuscates the obtained feature vector by removing the components in the null space of the next layer of the model as well as the lowenergy components of the remaining signal. Our experimental results show that our method outperforms existing techniques in removing the information of the irrelevant attributes, reduces the communication cost, maintains the accuracy, and incurs only a small computational overhead. 
Mohammad Samragh · Hossein Hosseini · Aleksei Triastcyn · Kambiz Azarian · Joseph B Soriaga · Farinaz Koushanfar 🔗 


Withinlayer Diversity Reduces Generalization Gap
(Poster)
[ Visit Poster at Spot B0 in Virtual World ]
Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradientbased optimization. At each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'betweenlayer' feedback with additional 'withinlayer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically and empirically study how the withinlayer activation diversity affects the generalization performance of a neural network and prove that increasing the diversity of hidden activations reduces the generalization gap. 
Firas Laakom · Jenni Raitoharju · Alexandros Iosifidis · Moncef Gabbouj 🔗 


Entropic Causal Inference: Identifiability for Trees and Complete Graphs
(Poster)
[ Visit Poster at Spot A6 in Virtual World ]
Entropic causal inference is a recent framework for learning the causal graph between two variables from observational data for models with small entropy. In this paper, we first extend the causal graph identifiability result in the twovariable setting under relaxed assumptions. Next, we show the first identifiability result using the entropic approach for learning causal graphs with more than two nodes. We provide a sequential peeling algorithm that is provably correct for trees and complete graphs. We also propose a heuristic algorithm for small graphs. We conduct rigorous experiments that demonstrate performance of our algorithms under synthetic data with different generative models, as well as with realworld data. 
Spencer Compton · Murat Kocaoglu · Kristjan Greenewald · Dmitriy Katz 🔗 


Towards a Unified InformationTheoretic Framework for Generalization
(Poster)
[ Visit Poster at Spot A5 in Virtual World ]
In this work, we investigate the expressiveness of the ``conditional mutual information'' (CMI) framework of Steinke and Zakynthinou (2020) and the prospect of using it to provide a unified framework for proving generalization bounds, with a focus on the realizable setting. We first demonstrate that one can use this framework to express nontrivial (but suboptimal) bounds for any learning algorithm that outputs hypotheses from a class of bounded VC dimension. Then, we prove that the CMI framework yields the optimal bound on the expected risk of Support Vector Machines (SVMs) for learning halfspaces. This result is an application of our general result showing that stable compression schemes Bousquetet al. (2020) of size k have uniformly bounded CMI of order O(k). We further show that an inherent limitation of proper learning of VC classes contradicts the existence of a proper learner with constant CMI, and it implies a negative resolution to an open problem of Steinke and Zakynthinou (2020). We further study the CMI of empirical risk minimizers (ERMs) of class H and show that it is possible to output all consistent classifiers (version space) with bounded CMI if and only if H has a bounded star number (Hanneke and Yang (2015)). 
Mahdi Haghifam · Gintare Karolina Dziugaite · Shay Moran 🔗 


Subpopulation Guarantees for Importance Weights and KLDivergence Estimation
(Poster)
[ Visit Poster at Spot A4 in Virtual World ]
The ratio between the probability that two distributions give to points in the domain are called importance weights or density ratios and they play a fundamental role in machine learning and information theory. Estimating them from samples is the subject of extensive research in multiple communities. However, there are strong lower bounds known for pointwise accurate estimation, and most theoretical guarantees require strong assumptions about the distributions. In this work we sidestep the lower bound by providing accuracy guarantees for the estimation of importance weights over \emph{subpopulations} belonging to a family $\mC$ of subsets of the domain. We argue that they capture intuitive expectations about importance weights, and could be viewed as a notion of a multigroup fairness guarantee. We formulate {\em sandwiching bounds} for sets: upper and lower bounds on the expected importance weight conditioned on a set. Additionally we show upper and lower bound on the expected log of the importance weight, often referred as the KullbackLeibler (KL) divergence between the distributions, a problem which received a lot of attention on its own. Our techniques are inspired by the notion of a multicalibrated predictor \cite{hkrr2018}, originally perceived as a fairness guarantee for predictors. We demonstrate the effectiveness of our approach in experiments.

Parikshit Gopalan · Nina Narodytska · Omer Reingold · Vatsal Sharan · Udi Wieder 🔗 


Excess Risk Analysis of Learning Problems via Entropy Continuity
(Poster)
[ Visit Poster at Spot A3 in Virtual World ]
This paper gives an overview of a recently developed framework of excess risk analysis of learning problems using the distributional continuity of the generalized entropy. The three paradigms of learning problems under consideration include: 1) frequentist learning, 2) Bayesian learning, and 3) learning by first fitting the empirical distribution to a predefined family of distributions and then designing the decision rule under the fitted distribution. It is shown that the excess risks in the first two paradigms can be studied via the continuity of various forms of generalized unconditional entropy, while the third paradigm can be studied via the continuity of generalized conditional entropy. Representative bounds derived for each learning problem are presented and discussed. 
Aolin Xu 🔗 


OutofDistribution Robustness in Deep Learning Compression
(Poster)
[ Visit Poster at Spot A2 in Virtual World ]
In recent years, deep neural network (DNN) compression systems have proved to be highly effective for designing source codes for many natural sources. However, like many other machine learning systems, these compressors suffer from vulnerabilities to distribution shifts as well as outofdistribution (OOD) data, which reduces their realworld applications. In this paper, we initiate the study of OOD robust compression. Considering robustness to distributions within a Wasserstein ball around a base distribution, we propose algorithmic and architectural frameworks built on two principled methods: one that trains DNN compressors using distributionallyrobust optimization (DRO), and the other which uses a structured latent code. Our results demonstrate that both methods enforce robustness compared to a standard DNN compressor, and that using a structured code can be superior to the DRO compressor. We observe tradeoffs between robustness and distortion and corroborate these findings theoretically for a specific class of sources. 
Eric Lei · Hamed Hassani 🔗 


Coded PrivacyPreserving Computation at Edge Networks
(Poster)
[ Visit Poster at Spot A1 in Virtual World ]
Multiparty computation (MPC) is promising for privacy preserving machine learning algorithms at edge networks, like federated learning. Despite their potential, existing MPC algorithms fail short of adapting to the limited resources of edge devices. A promising solution, and the focus of this work, is coded computation, which advocates the use of errorcorrecting codes to improve the performance of distributed computing through “smart” data redundancy. In this paper, we design novel coded privacypreserving computation mechanisms; MatDot coded MPC (MatDotCMPC) and PolyDot coded MPC (PolyDotCMPC) by employing recently proposed coded computation algorithms; MatDot and PolyDot. We take advantage of the “garbage terms” that naturally arise when polynomials are constructed in the design of MatDotCMPC and PolyDotCMPC to reduce the number of workers needed for privacypreserving computation. 
Elahe Vedadi · Yasaman Keshtkarjahromi · Hulya Seferoglu 🔗 


aVAEs : Optimising variational inference by learning datadependent divergence skew
(Poster)
[ Visit Poster at Spot A0 in Virtual World ]
The {\em skewgeometric JensenShannon divergence} $\left(\textrm{JS}^{\textrm{G}_{\alpha}}\right)$ allows for an intuitive interpolation between forward and reverse KullbackLeibler (KL) divergence based on the skew parameter $\alpha$. While the benefits of the skew in $\textrm{JS}^{\textrm{G}_{\alpha}}$ are clearbalancing forward/reverse KL in a comprehensible mannerthe choice of optimal skew remains opaque and requires an expensive grid search. In this paper we introduce $\alpha$VAEs, which extend the $\textrm{JS}^{\textrm{G}_{\alpha}}$ variational autoencoder by allowing for learnable, and therefore datadependent, skew. We motivate the use of a parameterised skew in the dual divergence by analysing trends dependent on data complexity in synthetic examples. We also prove and discuss the dependency of the divergence minimum on the input data and encoder parameters, before empirically demonstrating that this dependency does not reduce to either direction of KL divergence for benchmark datasets. Finally, we demonstrate that optimised skew values consistently converge across a range of initial values and provide improved denoising and reconstruction properties. These render $\alpha$VAEs an efficient and practical modelling choice across a range of tasks, datasets, and domains.

Jacob Deasy · Tom McIver · Nikola Simidjievski · Pietro Lió 🔗 