Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning (ITR3)

Ahmad Beirami, Flavio Calmon, Berivan Isik, Haewon Jeong, Matthew Nokleby, Cynthia Rush


The empirical success of state-of-the-art machine learning (ML) techniques has outpaced their theoretical understanding. Deep learning models, for example, perform far better than classical statistical learning theory predicts, leading to its widespread use by Industry and Government. At the same time, the deployment of ML systems that are not fully understood often leads to unexpected and detrimental individual-level impact. Finally, the large-scale adoption of ML means that ML systems are now critical infrastructure on which millions rely. In the face of these challenges, there is a critical need for theory that provides rigorous performance guarantees for practical ML models; guides the responsible deployment of ML in applications of social consequence; and enables the design of reliable ML systems in large-scale, distributed environments.

For decades, information theory has provided a mathematical foundation for the systems and algorithms that fuel the current data science revolution. Recent advances in privacy, fairness, and generalization bounds demonstrate that information theory will also play a pivotal role in the next decade of ML applications: information-theoretic methods can sharpen generalization bounds for deep learning, provide rigorous guarantees for compression of neural networks, promote fairness and privacy in ML training and deployment, and shed light on the limits of learning from noisy data.

We propose a workshop that brings together researchers and practitioners in ML and information theory to encourage knowledge transfer and collaboration between the sister fields. For information theorists, the workshop will highlight novel and socially-critical research directions that promote reliable, responsible, and rigorous development of ML. Moreover, the workshop will expose ICML attendees to emerging information-theoretic tools that may play a critical role in the next decade of ML applications.

Chat is not available.

Timezone: »


Sat 7:00 a.m. - 7:15 a.m.
Opening Remarks (Intro & Welcome)   
Ahmad Beirami
Sat 7:15 a.m. - 8:00 a.m.
Virtual Poster Session #1 (Poster Session)  link »
Sat 8:00 a.m. - 8:30 a.m.
Invited Talk: Maxim Raginsky (Invited Talk)   
Maxim Raginsky
Sat 8:30 a.m. - 8:45 a.m.
Q&A: Maxim Raginsky (Q&A)
Sat 8:45 a.m. - 9:15 a.m.
Invited Talk: Alex Dimakis (Invited Talk)   
Alex Dimakis
Sat 9:15 a.m. - 9:30 a.m.
Q&A: Alex Dimakis (Q&A)
Sat 9:30 a.m. - 10:00 a.m.
Small Break (Break)
Sat 10:00 a.m. - 10:30 a.m.
Invited Talk: Kamalika Chaudhuri (Invited Talk)   
Kamalika Chaudhuri
Sat 10:30 a.m. - 10:45 a.m.
Q&A: Kamalika Chaudhuri (Q&A)
Sat 10:45 a.m. - 11:15 a.m.
Invited Talk: Todd Coleman (Invited Talk)   
Todd Coleman
Sat 11:15 a.m. - 11:30 a.m.
Q&A: Todd Coleman (Q&A)
Sat 11:30 a.m. - 11:45 a.m.
Contributed Talk #1 (Contributed Talk)   
Eric Lei, Hamed Hassani, Shirin Bidokhti
Sat 11:45 a.m. - 12:00 p.m.
Contributed Talk #2 (Contributed Talk)   
Borja Rodríguez Gálvez, Mikael Skoglund, Ragnar Thobaben, German Bassi
Sat 12:00 p.m. - 1:00 p.m.
Big Break (Break)
Sat 1:00 p.m. - 1:45 p.m.
Panel Discussion (Panel)   
Sat 1:45 p.m. - 2:15 p.m.
Invited Talk: Kush Varshney (Invited Talk)   
Kush Varshney
Sat 2:15 p.m. - 2:30 p.m.
Q&A: Kush Varshney (Q&A)
Sat 2:30 p.m. - 3:00 p.m.
Invited Talk: Thomas Steinke (Invited Talk)   
Thomas Steinke
Sat 3:00 p.m. - 3:15 p.m.
Q&A: Thomas Steinke (Q&A)
Sat 3:15 p.m. - 4:00 p.m.
Virtual Poster Session #2 (Poster Session)  link »
Sat 4:00 p.m. - 4:30 p.m.
Invited Talk: Lalitha Sankar (Invited Talk)   
Lalitha Sankar
Sat 4:30 p.m. - 4:45 p.m.
Q&A: Lalitha Sankar (Q&A)
Sat 4:45 p.m. - 5:00 p.m.
Contributed Talk #3 (Contributed Talk)   
Sharu Jose, Osvaldo Simeone
Sat 5:00 p.m. - 5:15 p.m.
Contributed Talk #4 (Contributed Talk)   
Mohammad Samragh, Hossein Hosseini, Kambiz Azarian, Farinaz Koushanfar
Sat 5:15 p.m. - 5:45 p.m.
Invited Talk: David Tse (Invited Talk)   
David Tse
Sat 5:45 p.m. - 6:00 p.m.
Q&A: David Tse (Q&A)
Sat 6:00 p.m. - 6:15 p.m.
Concluding Remarks   
Sat 6:15 p.m. - 7:00 p.m.
Social Hour  link »
[ Visit Poster at Spot D4 in Virtual World ]

Enhanced processing power in the cloud allows constrained devices to offload costly computations: these remote executions call for a new communication paradigm that optimizes performance on the computational task within a rate constraint. We consider a simple binary hypothesis testing scenario where the transmitter performs fixed-length single-shot compression on data sampled from one of two distributions; the receiver performs a hypothesis test on multiple received samples to determine the correct source distribution. We formulate the task-aware compression problem as finding the optimal source coder that maximizes the asymptotic performance of the hypothesis test on the receiver side under a rate constraint. We propose a new source coding strategy that outperforms universal fixed-length single-shot coding scheme for a range of rate constraints.

Fabrizio Carpi, Siddharth Garg, Elza Erkip
[ Visit Poster at Spot D3 in Virtual World ]

We show when maximizing a properly defined f-divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of f-divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different f-divergence functions. With established robustness, this family of f-divergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental evidence. Our code is available at

Jiaheng Wei, Yang Liu
[ Visit Poster at Spot D2 in Virtual World ]

We consider privacy-utility trade-off for a time-series data sharing scenario, where a user releases her data containing some personal information in return of a service. We model user's personal information as two correlated random variables, one of them, called the secret variable'', is to be kept private, while the other, called theuseful variable'', is to be disclosed for utility. We consider active sequential data release, where the user chooses from among a finite set of release mechanisms with different statistics in such a way that the latent useful hypothesis is maximally revealed to the legitimate receiver while the confidence in the true secret variable is kept below a predefined level. For the utility, we consider both the probability of correct detection of the useful variable and the mutual information (MI) between the useful variable and released data. We formulate both problems as a Markov decision process (MDP), and numerically solve them by advantage actor-critic (A2C) deep reinforcement learning (RL).

Ecenaz Erdemir, Pier Luigi Dragotti, Deniz Gunduz
[ Visit Poster at Spot D1 in Virtual World ]

Machine unlearning refers to mechanisms that can remove the influence of a subset of training data upon request from a trained model without incurring the cost of re-training from scratch. This paper develops a unified PAC-Bayesian framework for machine unlearning that recovers the two recent design principles -- variational unlearning (Nguyen et al., 2020) and forgetting Lagrangian (Golatkar et al., 2020) -- as information risk minimization problems (Zhang, 2006). Accordingly, both criteria can be interpreted as PAC-Bayesian upper bounds on the test loss of the unlearned model that take the form of free energy metrics.

Sharu Jose, Osvaldo Simeone
[ Visit Poster at Spot D0 in Virtual World ]
The minimum mean-square error (MMSE) achievable by optimal estimation of a random variable $S$ given another random variable $T$ is of much interest in a variety of statistical contexts. Motivated by a growing interest in auditing machine learning models for unintended information leakage, we propose a neural network-based estimator of this MMSE. We derive a lower bound for the MMSE based on the proposed estimator and the Barron constant associated with the conditional expectation of $S$ given $T$. Since the latter is typically unknown in practice, we derive a general bound for the Barron constant that produces order optimal estimates for canonical distribution models.
Mario Diaz, Peter Kairouz, Lalitha Sankar
[ Visit Poster at Spot C6 in Virtual World ]
We introduce a tunable GAN, called $\alpha$-GAN, parameterized by $\alpha \in (0,\infty]$, which interpolates between various $f$-GANs and Integral Probability Metric based GANs (under constrained discriminator set). We construct $\alpha$-GAN using a supervised loss function, namely, $\alpha$-loss, which is a tunable loss function capturing several canonical losses. We show that $\alpha$-GAN is intimately related to the Arimoto divergence, which was first proposed by \"{O}sterriecher (1996), and later studied by Liese and Vajda (2006). We posit that the holistic understanding that $\alpha$-GAN introduces will have practical benefits of addressing both the issues of vanishing gradients and mode collapse.
Gowtham Raghunath Kurri, Tyler Sypherd, Lalitha Sankar
[ Visit Poster at Spot C5 in Virtual World ]

Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses many held-out examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts"). Here, we evaluate the few-shot ability of LMs when such held-out examples are unavailable, a setting we call true few-shot learning. We test two model selection criteria, cross-validation and minimum description length, for choosing LM prompts and hyperparameters in the true few-shot setting. On average, both marginally outperform random selection and greatly underperform selection based on held-out examples. Moreover, selection criteria often prefer models that perform significantly worse than randomly-selected ones. We find similar results even when varying number of examples used for selection, the amount of computation, and the selection criterion itself. Overall, our findings suggest that prior work significantly overestimated the true few-shot ability of LMs given the difficulty of few-shot model selection.

Ethan Perez, Douwe Kiela, Kyunghyun Cho
[ Visit Poster at Spot C4 in Virtual World ]

We study learning algorithms when there is a mismatch between the distributions of the training and test datasets of a learning algorithm. The effect of this mismatch on the generalization error and model misspecification are quantified. Moreover, we provide a connection between the generalization error and the rate-distortion theory, which allows one to use bounds from the rate-distortion theory to derive new bounds on the generalization error and vice versa. In particular, the rate-distortion-based bound strictly improves over the earlier bound by Xu and Raginsky even when there is no mismatch.

Mohammad Saeed Masiha, Mohammad Reza Aref
[ Visit Poster at Spot C3 in Virtual World ]

Conventional approaches to mitigate stragglers in distributed synchronous gradient descent include predicting and avoiding stragglers or replicating tasks across workers. However, predicting stragglers require substantial data about the computing nodes in the distributed system, and coding theoretic approaches provide more efficient use of redundant computations to mitigate stragglers. In this paper, we consider a coding theoretic framework, called gradient coding, to mitigate stragglers. Recently, Kadhe et al. proposed a gradient code based on a combinatorial design, called balanced incomplete block design (BIBD), which is shown to outperform many existing gradient codes, but exist for a very limited set of parameters. In this paper, we aim to overcome such limitations and construct gradient codes which exist for a wide range of parameters while retaining the superior performance of BIBD gradient codes.Two such constructions are proposed, one based on a probabilistic construction that relax the stringent BIBD gradient code constraints, and the other based on taking the Kronecker product of existing gradient codes.Theoretical error bounds for worst-case adversarial straggling scenarios are derived. Simulations show that the proposed constructions outperform existing gradient codes with similar redundancy per data piece.

Animesh Sakorikar, Lele Wang
[ Visit Poster at Spot C2 in Virtual World ]

The matrix completion problem, which aims to recover a low-rank matrix X from partial, noisy observations of its entries, arises in many machine learning applications. In this work, we present a novel information-theoretic framework for initial and sequential sampling of matrix entries for noisy matrix completion, based on the maximum entropy sampling principle in Shewry & Wynn (1987). The key novelty in our approach is that it makes use of uncertainty quantification (UQ) – a measure of uncertainty for unobserved entries – to guide the sampling procedure. Our framework reveals new insights on the role of coherence and coding design on sampling for matrix completion.

Simon Mak, Shaowu Yuchi, Yao Xie
[ Visit Poster at Spot C1 in Virtual World ]

Mutual information (MI) is a fundamental measure of statistical dependence, with a myriad of applications to information theory, statistics, and machine learning. While it possesses many desirable structural properties, the estimation of high-dimensional MI from samples suffers from the curse of dimensionality. Motivated by scalability to high dimensions, this paper proposes \emph{sliced} MI (SMI) as a surrogate measure of dependence. SMI is defined as an average of MI terms between one-dimensional random projections. We show that it preserves many of the structural properties of classic MI, while gaining scalable computation and estimation from samples. Furthermore, and in contrast to classic MI, SMI can be created from computation. This violation of the data processing inequality is consistent with modern feature extraction techniques that often result in post-processed representations that are more useful for inference than the raw data. Our theory is supported by a numerical study of independence testing and feature extraction, which demonstrate the potential gains SMI offers over classic MI for high-dimensional inference.

Ziv Goldfeld, Kristjan Greenewald
[ Visit Poster at Spot C0 in Virtual World ]

We introduce the General Pairwise Model (GPM), a general parametric framework for pairwise comparison. Under the umbrella of the exponential family, the GPM unifies many popular models with discrete observations, including the Thurstone (Case V), Berry-Terry-Luce (BTL) and Ordinal Models, along with models with continuous observations, such as the Gaussian Pairwise Cardinal Model. We establish minimax lower bounds with tight topological dependence. When applied as a special case to the Ordinal Model, our results uniformly improve upon previously known lower bounds and confirms one direction of a conjecture put forth by Shah et al., (2016). Performance guarantees of the MLE for a broad class of GPMs with subgaussian assumptions are given and compared against our lower bounds, showing that in many natural settings the MLE is optimal up to constants. Matching lower and upper bounds (up to constants) are achieved by the Gaussian Pairwise Cardinal Model, suggesting that our lower bounds are best-possible under the few assumptions we adopt.

Kuan-Yun Lee, Thomas Courtade
[ Visit Poster at Spot B6 in Virtual World ]

Bounding the generalization error of a supervised learning algorithm is one of the most important problems in learning theory, and various approaches have been developed. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm. Our main contribution is an exact characterization of the expected generalization error of the well-known Gibbs algorithm in terms of symmetrized KL information between the input training samples and the output hypothesis. Such a result can be applied to tighten existing expected generalization error bound. Our analysis provides more insight on the fundamental role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.

Gholamali Aminian, Yuheng Bu, Laura Toni, Miguel Rodrigues, Gregory Wornell
[ Visit Poster at Spot B5 in Virtual World ]

In several applications, besides getting a generative model of the data, we also want the model to be useful for specific downstream tasks. Mixture models are useful for identifying discrete components in the data, but may not identify components useful for downstream tasks if misspecified; further, current inference techniques often fail to overcome misspecification even when a supervisory signal is provided. We introduce the prediction-focused mixture model, which selects and models input features relevant to predicting the targets. We demonstrate that our approach identifies relevant signal from inputs even when the model is highly misspecified.

Abhishek Sharma, Sanjana Narayanan, Catherine Zeng, Finale Doshi-Velez
[ Visit Poster at Spot B4 in Virtual World ]
This work presents several expected generalization error bounds based on the Wasserstein distance. More specifically, it introduces full-dataset, single-letter, and random-subset bounds, and their analogous in the randomized subsample setting from Steinke and Zakynthinou (2020). Moreover, when the loss function is bounded and the geometry of the space is ignored by the choice of the metric in the Wasserstein distance, these bounds recover from below (and thus, are tighter than) current bounds based on the relative entropy. In particular, they generate new, non-vacuous bounds based on the relative entropy. Therefore, these results can be seen as a bridge between works that account for the geometry of the hypothesis space and those based on the relative entropy, which are agnostic to such geometry. Furthermore, it is shown how to produce various new bounds based on different information measures (e.g., the lautum information or several $f$-divergences) based on these bounds and how to derive similar bounds with respect to the backward channel using the presented proof techniques.
Borja Rodríguez Gálvez, German Bassi, Ragnar Thobaben, Mikael Skoglund
[ Visit Poster at Spot B3 in Virtual World ]

The PAC-Bayesian framework has proven to be a useful tool to obtain nonvacuous generalization bounds for modern learning algorithms, such as overparameterized neural networks. A known heuristic to tighten such bounds is to use data-dependent priors. In this paper, we show how the information-theoretically motivated random-subset setting introduced by Steinke & Zakynthinou (2020) enables the derivation of PAC-Bayesian bounds that naturally involve a data-dependent prior. We evaluate these bounds for neural networks trained on MNIST and Fashion-MNIST, and study their dependence on the training set size, the achieved training accuracy, and the effect of randomized labels.

Fredrik Hellström, Giuseppe Durisi
[ Visit Poster at Spot B2 in Virtual World ]
A sample $X\in\R^{p}$ is generated according to one of the two known $p$-dimensional Gaussian models. This paper establishes the information-theoretic limits of {\bf actively} sampling the one-dimensional coordinates of $X$ for detecting its true model as $p\to\infty$. We address three questions pertinent to active model detecting. First, we establish conditions under which the problem is feasible under controlled error rates. Secondly, we characterize the information-theoretic lower bound on the coordinate-level sample complexity for model detection. Finally, we provide an active sampling algorithm that progressively identifies the coordinates to be sampled, and prove that this algorithm's sample complexity meets the information-theoretic limit in the asymptote of large $p$.
Javad Heydari, Ali Tajer
[ Visit Poster at Spot B1 in Virtual World ]

Splitting network computations between the edge device and a server enables low edge-compute inference of neural networks but might expose sensitive information about the test query to the server. To address this problem, existing techniques train the model to minimize information leakage for a given set of sensitive attributes. In practice, however, the test queries might contain attributes that are not foreseen during training. We propose instead an unsupervised obfuscation method to discard the information irrelevant to the main task. We formulate the problem via an information theoretical framework and derive an analytical solution for a given distortion to the model output. In our method, the edge device runs the model up to a split layer determined based on its computational capacity. It then obfuscates the obtained feature vector by removing the components in the null space of the next layer of the model as well as the low-energy components of the remaining signal. Our experimental results show that our method outperforms existing techniques in removing the information of the irrelevant attributes, reduces the communication cost, maintains the accuracy, and incurs only a small computational overhead.

Mohammad Samragh, Hossein Hosseini, Aleksei Triastcyn, Kambiz Azarian, Joseph B Soriaga, Farinaz Koushanfar
[ Visit Poster at Spot B0 in Virtual World ]

Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization. At each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically and empirically study how the within-layer activation diversity affects the generalization performance of a neural network and prove that increasing the diversity of hidden activations reduces the generalization gap.

Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Moncef Gabbouj
[ Visit Poster at Spot A6 in Virtual World ]

Entropic causal inference is a recent framework for learning the causal graph between two variables from observational data for models with small entropy. In this paper, we first extend the causal graph identifiability result in the two-variable setting under relaxed assumptions. Next, we show the first identifiability result using the entropic approach for learning causal graphs with more than two nodes. We provide a sequential peeling algorithm that is provably correct for trees and complete graphs. We also propose a heuristic algorithm for small graphs. We conduct rigorous experiments that demonstrate performance of our algorithms under synthetic data with different generative models, as well as with real-world data.

Spencer Compton, Murat Kocaoglu, Kristjan Greenewald, Dmitriy Katz
[ Visit Poster at Spot A5 in Virtual World ]

In this work, we investigate the expressiveness of the ``conditional mutual information'' (CMI) framework of Steinke and Zakynthinou (2020) and the prospect of using it to provide a unified framework for proving generalization bounds, with a focus on the realizable setting. We first demonstrate that one can use this framework to express non-trivial (but sub-optimal) bounds for any learning algorithm that outputs hypotheses from a class of bounded VC dimension. Then, we prove that the CMI framework yields the optimal bound on the expected risk of Support Vector Machines (SVMs) for learning halfspaces. This result is an application of our general result showing that stable compression schemes Bousquetet al. (2020) of size k have uniformly bounded CMI of order O(k). We further show that an inherent limitation of proper learning of VC classes contradicts the existence of a proper learner with constant CMI, and it implies a negative resolution to an open problem of Steinke and Zakynthinou (2020). We further study the CMI of empirical risk minimizers (ERMs) of class H and show that it is possible to output all consistent classifiers (version space) with bounded CMI if and only if H has a bounded star number (Hanneke and Yang (2015)).

Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran
[ Visit Poster at Spot A4 in Virtual World ]
The ratio between the probability that two distributions give to points in the domain are called importance weights or density ratios and they play a fundamental role in machine learning and information theory. Estimating them from samples is the subject of extensive research in multiple communities. However, there are strong lower bounds known for point-wise accurate estimation, and most theoretical guarantees require strong assumptions about the distributions. In this work we sidestep the lower bound by providing accuracy guarantees for the estimation of importance weights over \emph{sub-populations} belonging to a family $\mC$ of subsets of the domain. We argue that they capture intuitive expectations about importance weights, and could be viewed as a notion of a multi-group fairness guarantee. We formulate {\em sandwiching bounds} for sets: upper and lower bounds on the expected importance weight conditioned on a set. Additionally we show upper and lower bound on the expected log of the importance weight, often referred as the Kullback-Leibler (KL) divergence between the distributions, a problem which received a lot of attention on its own. Our techniques are inspired by the notion of a multicalibrated predictor \cite{hkrr2018}, originally perceived as a fairness guarantee for predictors. We demonstrate the effectiveness of our approach in experiments.
Parikshit Gopalan, Nina Narodytska, Omer Reingold, Vatsal Sharan, Udi Wieder
[ Visit Poster at Spot A3 in Virtual World ]

This paper gives an overview of a recently developed framework of excess risk analysis of learning problems using the distributional continuity of the generalized entropy. The three paradigms of learning problems under consideration include: 1) frequentist learning, 2) Bayesian learning, and 3) learning by first fitting the empirical distribution to a predefined family of distributions and then designing the decision rule under the fitted distribution. It is shown that the excess risks in the first two paradigms can be studied via the continuity of various forms of generalized unconditional entropy, while the third paradigm can be studied via the continuity of generalized conditional entropy. Representative bounds derived for each learning problem are presented and discussed.

Aolin Xu
[ Visit Poster at Spot A2 in Virtual World ]

In recent years, deep neural network (DNN) compression systems have proved to be highly effective for designing source codes for many natural sources. However, like many other machine learning systems, these compressors suffer from vulnerabilities to distribution shifts as well as out-of-distribution (OOD) data, which reduces their real-world applications. In this paper, we initiate the study of OOD robust compression. Considering robustness to distributions within a Wasserstein ball around a base distribution, we propose algorithmic and architectural frameworks built on two principled methods: one that trains DNN compressors using distributionally-robust optimization (DRO), and the other which uses a structured latent code. Our results demonstrate that both methods enforce robustness compared to a standard DNN compressor, and that using a structured code can be superior to the DRO compressor. We observe tradeoffs between robustness and distortion and corroborate these findings theoretically for a specific class of sources.

Eric Lei, Hamed Hassani
[ Visit Poster at Spot A1 in Virtual World ]

Multi-party computation (MPC) is promising for privacy preserving machine learning algorithms at edge networks, like federated learning. Despite their potential, existing MPC algorithms fail short of adapting to the limited resources of edge devices. A promising solution, and the focus of this work, is coded computation, which advocates the use of error-correcting codes to improve the performance of distributed computing through “smart” data redundancy. In this paper, we design novel coded privacy-preserving computation mechanisms; MatDot coded MPC (MatDot-CMPC) and PolyDot coded MPC (PolyDot-CMPC) by employing recently proposed coded computation algorithms; MatDot and PolyDot. We take advantage of the “garbage terms” that naturally arise when polynomials are constructed in the design of MatDot-CMPC and PolyDot-CMPC to reduce the number of workers needed for privacy-preserving computation.

Elahe Vedadi, Yasaman Keshtkarjahromi, Hulya Seferoglu
[ Visit Poster at Spot A0 in Virtual World ]
The {\em skew-geometric Jensen-Shannon divergence} $\left(\textrm{JS}^{\textrm{G}_{\alpha}}\right)$ allows for an intuitive interpolation between forward and reverse Kullback-Leibler (KL) divergence based on the skew parameter $\alpha$. While the benefits of the skew in $\textrm{JS}^{\textrm{G}_{\alpha}}$ are clear---balancing forward/reverse KL in a comprehensible manner---the choice of optimal skew remains opaque and requires an expensive grid search. In this paper we introduce $\alpha$-VAEs, which extend the $\textrm{JS}^{\textrm{G}_{\alpha}}$ variational autoencoder by allowing for learnable, and therefore data-dependent, skew. We motivate the use of a parameterised skew in the dual divergence by analysing trends dependent on data complexity in synthetic examples. We also prove and discuss the dependency of the divergence minimum on the input data and encoder parameters, before empirically demonstrating that this dependency does not reduce to either direction of KL divergence for benchmark datasets. Finally, we demonstrate that optimised skew values consistently converge across a range of initial values and provide improved denoising and reconstruction properties. These render $\alpha$-VAEs an efficient and practical modelling choice across a range of tasks, datasets, and domains.
Jacob Deasy, Tom McIver, Nikola Simidjievski, Pietro Lió