Workshop
Duality Principles for Modern Machine Learning
Thomas Moellenhoff · Zelda Mariet · Mathieu Blondel · Khan Emtiyaz
Meeting Room 320
Duality is a pervasive and important principle in mathematics. Not only has it fascinated researchers in many different fields but it has also been used extensively in optimization, statistics, and machinelearning (ML), giving rise to powerful tools such as Fenchel duality in convex optimization, representer theorems in kernel methods and Bayesian nonparametrics, and duallyflat spaces in information geometry. Such applications have played an important role in the past, but lately we do not see much work on duality principles, especially in deep learning. For example, Lagrange duality can be useful for model explanation because it allows us to measure sensitivity of certain perturbations, but this is not yet fully exploited. This slowdown is perhaps due to a growing focus on nonconvex and nonlinear problems where duality does not seem to be directly applicable. There have not been any workshops on duality in recent years. With this workshop, we aim to revive the interest of the ML community in duality principles.The goal of the workshop is to bring together researchers working on various duality concepts from many different fields, and discuss new applications for modern machine learning, especially focusing on topics such as model understanding, explanation, and adaptation in deep learning and reinforcement learning.
Schedule
Sat 12:00 p.m.  12:30 p.m.

Duality: Opening remarks
(
Opening remarks
)
SlidesLive Video 
🔗 
Sat 12:30 p.m.  12:55 p.m.

Ronny Bergman: Fenchel Duality Theory on Riemannian Manifolds and the Riemannian ChambollePock Algorithm
(
Invited talk
)
SlidesLive Video 
🔗 
Sat 12:55 p.m.  1:05 p.m.

Coffee Break
(
Break
)

🔗 
Sat 1:05 p.m.  1:17 p.m.

Jaeyeon Kim: TimeReversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
(
Contributed talk
)
SlidesLive Video 
🔗 
Sat 1:17 p.m.  1:29 p.m.

Sina Baharlouei: RIFLE: Imputation and Robust Inference from Low Order Marginals
(
Contributed talk
)
SlidesLive Video 
🔗 
Sat 1:30 p.m.  1:42 p.m.

Joseph Shenouda: A Representer Theorem for VectorValued Neural Networks: Insights on Weight Decay Training and Widths of Deep Neural Networks
(
Contributed talk
)
SlidesLive Video 
🔗 
Sat 1:45 p.m.  2:10 p.m.

JiaJie Zhu: Duality from Gradient Flow ForceBalance to Distributionally Robust Learning
(
Invited talk
)
SlidesLive Video 
🔗 
Sat 2:10 p.m.  2:35 p.m.

Taiji Suzuki: Convergence of mean field Langevin dynamics: Duality viewpoint and neural network optimization
(
Invited talk
)
SlidesLive Video The meanfield Langevin dynamics (MFLD) is a nonlinear generalization of the gradient Langevin dynamics (GLD) that minimizes an entropy regularized convex function defined on the space of probability distributions, and it naturally arises from the optimization of twolayer neural networks via (noisy) gradient descent. In this talk, I will present the convergence result of MFLD and explain how the convergence of MFLD is connected to its dual objective. Indeed, its convergence is characterized by the logSobolev inequality of the socalled proximal Gibbs measure corresponding to the current solution. Based on this duality principle, we can construct several optimization methods with convergence guarantees including the particle dual averaging method and particle stochastic dual coordinate ascent method. Finally, I will provide a general framework to prove a uniformintime propagation of chaos for MFLD that takes into account the errors due to finiteparticle approximation, timediscretization, and stochastic gradient approximation. 
🔗 
Sat 2:35 p.m.  3:00 p.m.

Len Spek: Duality for Neural Networks through Reproducing Kernel Banach Spaces
(
Invited talk
)
SlidesLive Video 
🔗 
Sat 2:55 p.m.  4:00 p.m.

Lunch Break
(
Break
)

🔗 
Sat 4:00 p.m.  5:30 p.m.

Poster session
(
Poster session
)

🔗 
Sat 5:30 p.m.  5:55 p.m.

Amy Zhang: Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
(
Invited talk
)
SlidesLive Video 
🔗 
Sat 5:55 p.m.  6:35 p.m.

Coffee Break / Poster session
(
Break
)

🔗 
Sat 6:35 p.m.  7:00 p.m.

Ehsan Amid: A Dualistic View of Activations in Deep Neural Networks
(
Invited talk
)
SlidesLive Video 
🔗 
Sat 7:00 p.m.  8:00 p.m.

Panel discussion
(
Panel discussion
)
SlidesLive Video 
🔗 


Sparse Functionspace Representation of Neural Networks
(
Poster
)
Deep neural networks are known to lack uncertainty estimates, struggle to incorporate new data, and suffer from catastrophic forgetting. We present a method that mitigates these issues by converting neural networks from weightspace to a lowrank functionspace representation, via the socalled dual parameters. In contrast to previous work, our sparse representation captures the joint distribution over the entire data set, rather than only over a subset. This offers a compact and principled way of capturing uncertainty and enables us to incorporate new data without retraining whilst retaining predictive performance. We provide proofofconcept demonstrations with the proposed approach for quantifying uncertainty in supervised learning on UCI benchmark tasks. 
Aidan Scannell · Riccardo Mereu · Paul Chang · Ella Tamir · Joni Pajarinen · Arno Solin 🔗 


Memory Maps to Understand Models
(
Poster
)
What do models know and how? Answering this question requires exploratory analyses comparing many models, but existing techniques are specialized to specific models and analyses. We present memory maps as a general tool to understand a wide range of models by visualizing their sensitivity to data. Memory maps are extensions of residualleverage plots where the two criteria are modified by easytocompute dual parameters obtained by using a Bayesian framework. The new criteria are used to understand a model's memory through a 2D scatter plot where tail regions often contain examples with high predictionerror and variance. All sorts of models can be analyzed this way, including not only those arising in kernel methods, Bayesian methods, and deep learning but also the ones obtained during training. We show use cases of memory maps to diagnose overfitting, compare various models, and analyze training trajectories. 
Dharmesh Tailor · Paul Chang · Siddharth Swaroop · Eric Nalisnick · Arno Solin · Khan Emtiyaz 🔗 


Implicit Interpretation of Importance Weight Aware Updates
(
Poster
)
Due to its speed and simplicity, subgradient descent is one of the most used optimization algorithms in convex machine learning algorithms. However, tuning its learning rate is probably its most severe bottleneck to achieve consistent good performance. A common way to reduce the dependency on the learning rate is to use implicit/proximal updates. One such variant is the Importance Weight Aware (IWA) updates, which consist of infinitely many infinitesimal updates on each loss function. However, IWA updates' empirical success is not completely explained by their theory. In this paper, we show for the first time that IWA updates have a strictly better regret upper bound than plain gradient updates in the online learning setting. Our analysis is based on the new framework by Chen and Orabona (2023) to analyze generalized implicit updates using a dual formulation. In particular, our results imply that IWA updates can be considered as approximate implicit/proximal updates. 
Keyi Chen · Francesco Orabona 🔗 


Learning with PrimalDual Spectral Risk Measures: a Fast Incremental Algorithm
(
Poster
)
We consider learning with a generalization of rankweighted objectives known as spectral risk measures (SRMs). SRMs, like other distributionally robust learning objectives, explicitly capture the worstcase performance over a set of possible deviations from the observed training distribution by introducing dual variables maximizing an adversarial objective. Even when the underlying (regularized) losses are smooth and strongly convex, direct stochastic gradient methods fail to converge to the SRM minimizer due in part to bias. In this work, we introduce a fast incremental optimization algorithm for SRMs that maintains a running estimate of the optimal dual variables by efficiently solving an approximation of the dual problem. Unlike related methods, our approach converges linearly for any smooth SRM and requires tuning a single hyperparameter: the (constant) primal learning rate. Empirically, our optimizer can achieve convergence within 23x fewer passes through the training set than recent baselines on distribution shift and fairness benchmarks. 
Ronak Mehta · Vincent Roulet · Krishna Pillutla · Zaid Harchaoui 🔗 


Estimating Joint interventional distributions from marginal interventional data
(
Poster
)
We consider settings where only marginal datasets of a target variable and some treatment variable are available, but no joint observations of the target and all treatments together  the socalled Causal Marginal Problem setting. In this paper we show how to exploit interventional data to acquire the joint conditional distribution of all the variables using the Maximum Entropy principle. To this end, we extend the Causal Maximum Entropy method to make use of interventional data in addition to observational data. Using Lagrange duality, we prove that the solution to the Causal Maximum Entropy problem with interventional constraints lies in the exponential family, as in the Maximum Entropy solution. Our method allows us to perform two tasks of interest when marginal interventional distributions are provided for all, or only some, of the variables. First, we show how to perform causal feature selection from a mixture of observational and singlevariable interventional data, and, second, how to infer joint interventional distributions. For the former task, we show on synthetically generated data, that our proposed method outperforms the stateoftheart method on merging datasets, and yields comparable results to the KCItest which requires access to joint observations of all variables. 
Sergio Garrido Mejia · Elke Kirschbaum · Armin KekiÄ‡ · Atalanti Mastakouri 🔗 


RIFLE: Imputation and Robust Inference from Low Order Marginals
(
Poster
)
The ubiquity of missing values in realworld datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such lowaccuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Loworder moment Estimations), estimates loworder moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to ridge linear regression, where the resulting minmax problem is efficiently solved by applying the alternating direction method of multipliers (ADMM) on the dual problem. This framework can also be adapted to impute missing data. We compare RIFLE with stateoftheart approaches (including MICE, Amelia, MissForest, KNNimputer, MIDA, and Mean Imputer) in numerical experiments. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. 
Sina Baharlouei · Kelechi Ogudu · Peng Dai · SzeChuan Suen · Meisam Razaviyayn 🔗 


A Representer Theorem for VectorValued Neural Networks: Insights on Weight Decay Training and Widths of Deep Neural Networks
(
Poster
)
This paper characterizes the kinds of functions learned by multioutput (vectorvalued) ReLU neural networks trained with weight decay.This extends previous results that were limited to singleoutput networks, which is crucial to understanding the effects of weight decay on deep neural networks (DNNs). The new characterization requires the definition of a new class of neural function spaces that we call vectorvalued variation (VV) spaces. By exploiting the (Banach) duality between the space ofvectorvalued measures and the space of vectorvalued continuous functions, we prove that neural networks (NNs) are optimal solutions to learning problems posed over VV spaces via a novel representer theorem. Our representer theorem shows that solutions to these learning problems exist as vectorvalued NNs with widths bounded in terms of the number of training samples. Next, via a novel connection to the multitask lasso problem, we derive datadependent bounds on the widths of homogeneous layers in DNNs. The bounds are determined by the effective dimensions of the training data embeddings in/out of the layers. These results shed new light on the regularity of DNN functions trained with weight decay as well as the kinds of architectures weight decay induces. 
Joseph Shenouda · Rahul Parhi · Kangwook Lee · Robert Nowak 🔗 


Kernel Mirror Prox and RKHS Gradient Flow for Mixed Functional Nash Equilibrium
(
Poster
)
The theoretical analysis of machine learning algorithms, such as deep generative modeling, motivates multiple recent works on the Mixed Nash Equilibrium (MNE) problem.Different from MNE,this paper formulates theMixed Functional Nash Equilibrium (MFNE),which replaces one of the measure optimization problems with optimization over a class of dual functions, e.g., the reproducing kernel Hilbert space (RKHS) in the case of Mixed Kernel Nash Equilibrium (MKNE).We show that our MFNE and MKNE framework form the backbones that govern several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters.To model the infinitedimensional continuouslimit optimization dynamics,we propose the Interacting WassersteinKernel Gradient Flow, which includes the RKHS flow that is much less common than the Wasserstein gradient flow but enjoys a much simpler convexity structure.Timediscretizing this gradient flow, we propose a primaldual kernel mirror prox algorithm, which alternates between a dual step in the RKHS, and a primal step in the space of probability measures.We then provide the first unified convergence analysis of our algorithm for this class of MKNE problems,which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case.As a case study, we apply our analysis to DRO, providing the first primaldual convergence analysis for DRO with probabilitymetric constraints.

Pavel Dvurechenskii · JiaJie Zhu 🔗 


Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum
(
Poster
)
Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement. 
Amnon Geifman · Daniel Barzilai · Ronen Basri · Meirav Galun 🔗 


EnergyBased NonNegative Tensor Factorization via MultiBody Modeling
(
Poster
)
Tensor factorization has fundamental difficulties in rank tuning and optimization. To avoid these difficulties, we develop a rankfree energybased tensor factorization, called manybody approximation, that allows intuitive modeling of tensors and global optimization. Our approach models tensors as distributions via the energy function, which describes interactions between modes, and a dually flat statistical manifold is induced. We visualize these interactions as tensor networks and reveal a nontrivial relationship between manybody approximation and lowrank approximation. 
Kazu Ghalamkari · Mahito Sugiyama 🔗 


A Dual Formulation for Probabilistic Principal Component Analysis
(
Poster
)
In this paper, we reformulate the Probabilistic Principal Component Analysis framework in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset. 
Henri De Plaen · Johan Suykens 🔗 


Duality in MultiView Restricted Kernel Machines
(
Poster
)
We propose a unifying setting that combines existing restricted kernel machine methods into a single primaldual multiview framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full equivalence in primal and dual formulations by rescaling primal variables. Finally, we experimentally validate the equivalence and provide insight into the relationships between different methods on a number of time series data sets by recursively forecasting unseen test data and visualizing the learned features. 
Sonny Achten · Arun Pandey · Hannes De Meulemeester · Bart Moor · Johan Suykens 🔗 


On the FisherRao Gradient of the Evidence Lower Bound
(
Poster
)
This article studies the FisherRao gradient, also referred to as the natural gradient, of the evidence lower bound, known from machine learning. Based on invariance properties of gradients within information geometry, it derives conditions on the underlying model that ensure the exactness for this bound. Information geometry is naturally based on duality concepts from differential geometry. 
Jesse van Oostrum · Nihat Ay 🔗 


Duality Principle and Biologically Plausible Learning: Connecting the Representer Theorem and Hebbian Learning
(
Poster
)
A normative approach called Similarity Matching was recently introduced for deriving and understanding the algorithmic basis of neural computation focused on unsupervised problems. It involves deriving algorithms from computational objectives and evaluating their compatibility with anatomical and physiological observations. In particular, it introduces neural architectures by considering dual alternatives instead of primal formulations of popular models such as PCA. However, its connection to the Representer theorem remains unexplored. In this work, we propose to use teachings from this approach to explore supervised learning algorithms and clarify the notion of Hebbian learning. We examine regularized supervised learning and elucidate the emergence of neural architecture and additive versus multiplicative update rules. In this work, we focus not on developing new algorithms but on showing that the Representer theorem offers the perfect lens to study biologically plausible learning algorithms. We argue that many past and current advancements in the field rely on some form of dual formulation to introduce biological plausibility. In short, as long as a dual formulation exists, it is possible to derive biologically plausible algorithms. Our work sheds light on the pivotal role of the Representer theorem in advancing our comprehension of neural computation. 
Yanis Bahroun · Dmitri Chklovskii · Anirvan Sengupta 🔗 


A maxaffine spline approximation of neural networks using the Legendre transform of a convexconcave representation
(
Poster
)
This work presents a novel algorithm for transforming a neural network into a spline representation. Unlike previous work that required convex and piecewiseaffine network operators to create a maxaffine spline alternate form, this work relaxes this constraint. The only constraint is that the function be bounded and possess a welldefine second derivative, although this was shown experimentally to not be strictly necessary. It can also be performed over the whole network rather than on each layer independently. As in previous work, this bridges the gap between neural networks and approximation theory but also enables the visualisation of network feature maps. Mathematical proof and experimental investigation of the technique is performed with approximation error and feature maps being extracted from a range of architectures, including convolutional neural networks. 
Adam Perrett · Danny Wood · Gavin Brown 🔗 


RewardBased Reinforcement Learning with Risk Constraints
(
Poster
)
Constrained optimization provides a common framework and solvers for dealing with conflicting objectives in the context of reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects possibly catastrophic events at the tails of the distribution, and it is often insufficient for many highstake applications in which the risk involved in outliers is critical. In this work, we propose a new framework for riskaware reinforcement learning which handles rewardbased risk measures in both the objective and constraints which exhibits robustness properties jointly in time and space. The derivation makes connections between the convex analytic duality between returnbased and rewardbased formulations of RL and with Lagrangian duality, and can easily be implemented on top of existing RL algorithms. Furthermore, unlike the Lagrangian relaxations done in prior riskconstrained works, our framework admits an exact equivalence to the primal problem through a parameterized strong duality. Finally, we provide convergence guarantees of the proposed algorithm under common assumptions on the objective. 
Jane Lee · Konstantinos Nikolakakis · Dionysios Kalogerias · Amin Karbasi 🔗 


The Power of Duality Principle in Offline AverageReward Reinforcement Learning
(
Poster
)
Offline reinforcement learning (RL) is widely used to find an optimal policy using a precollected dataset, without further interaction with the environment. Recent RL theory has made significant progress in developing sampleefficient offline RL algorithms with various relaxed assumptions on data coverage, with specific focuses on either infinitehorizon discounted or finitehorizon episodic Markov decision processes (MDPs). In this work, we revisit the LP framework and the induced duality principle for offline RL, specifically for *infinitehorizon averagereward* MDPs. By virtue of this LP formulation and the duality principle, our result achieves the $\tilde O(1/\sqrt{n})$ nearoptimal rate under partial data coverage assumptions. Our key enabler is to *relax* the equality *constraint* and introduce proper new *inequality constraints* in the dual formulation of the LP. We hope our insights can shed new lights on the use of LP formulations and the induced duality principle, in offline RL.

Asuman Ozdaglar · Sarath Pattathil · Jiawei Zhang · Kaiqing Zhang 🔗 


DecisionAware ActorCritic with Function Approximation and Theoretical Guarantees
(
Poster
)
Actorcritic (AC) methods are widely used in reinforcement learning (RL), and benefit from the flexibility of using any policy gradient method as the actor and valuebased method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decisionaware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO), and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decisionaware actorcritic framework on simple RL problems. 
Sharan Vaswani · Amirreza Kazemi · Reza Babanezhad · Nicolas Le Roux 🔗 


TimeReversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
(
Poster
)
In convex optimization, firstorder optimization methods minimizing function values have been a central subject study since Nesterov's seminal work of 1983. Recently, however, Kim and Fessler's OGMG and Lee et al.'s FISTAG have been presented as alternatives that minimize the gradient magnitude instead. We present Hduality, which represents a surprising onetoone correspondence between methods efficiently minimizing function values and methods efficiently minimizing gradient magnitude. In continuoustime formulations, Hduality corresponds to reversing the time dependence of the dissipation/friction term. To the best of our knowledge, Hduality is different from Lagrange/Fenchel duality and is distinct from any previously known duality or symmetry relations. Using Hduality, we obtain a clearer understanding of the symmetry between Nesterov's method and OGMG, derive a new class of methods efficiently reducing gradient magnitudes of smooth convex functions, and find a new composite minimization method that is simpler and faster than FISTAG. 
Jaeyeon Kim · Asuman Ozdaglar · Chanwoo Park · Ernest Ryu 🔗 