Workshop
Duality Principles for Modern Machine Learning
Thomas Moellenhoff · Zelda Mariet · Mathieu Blondel · Khan Emtiyaz
Meeting Room 320
Duality is a pervasive and important principle in mathematics. Not only has it fascinated researchers in many different fields but it has also been used extensively in optimization, statistics, and machine-learning (ML), giving rise to powerful tools such as Fenchel duality in convex optimization, representer theorems in kernel methods and Bayesian nonparametrics, and dually-flat spaces in information geometry. Such applications have played an important role in the past, but lately we do not see much work on duality principles, especially in deep learning. For example, Lagrange duality can be useful for model explanation because it allows us to measure sensitivity of certain perturbations, but this is not yet fully exploited. This slowdown is perhaps due to a growing focus on nonconvex and nonlinear problems where duality does not seem to be directly applicable. There have not been any workshops on duality in recent years. With this workshop, we aim to revive the interest of the ML community in duality principles.The goal of the workshop is to bring together researchers working on various duality concepts from many different fields, and discuss new applications for modern machine learning, especially focusing on topics such as model understanding, explanation, and adaptation in deep learning and reinforcement learning.
Schedule
Sat 12:00 p.m. - 12:30 p.m.
|
Duality: Opening remarks
(
Opening remarks
)
SlidesLive Video |
🔗 |
Sat 12:30 p.m. - 12:55 p.m.
|
Ronny Bergman: Fenchel Duality Theory on Riemannian Manifolds and the Riemannian Chambolle-Pock Algorithm
(
Invited talk
)
SlidesLive Video |
🔗 |
Sat 12:55 p.m. - 1:05 p.m.
|
Coffee Break
(
Break
)
|
🔗 |
Sat 1:05 p.m. - 1:17 p.m.
|
Jaeyeon Kim: Time-Reversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
(
Contributed talk
)
SlidesLive Video |
🔗 |
Sat 1:17 p.m. - 1:29 p.m.
|
Sina Baharlouei: RIFLE: Imputation and Robust Inference from Low Order Marginals
(
Contributed talk
)
SlidesLive Video |
🔗 |
Sat 1:30 p.m. - 1:42 p.m.
|
Joseph Shenouda: A Representer Theorem for Vector-Valued Neural Networks: Insights on Weight Decay Training and Widths of Deep Neural Networks
(
Contributed talk
)
SlidesLive Video |
🔗 |
Sat 1:45 p.m. - 2:10 p.m.
|
Jia-Jie Zhu: Duality from Gradient Flow Force-Balance to Distributionally Robust Learning
(
Invited talk
)
SlidesLive Video |
🔗 |
Sat 2:10 p.m. - 2:35 p.m.
|
Taiji Suzuki: Convergence of mean field Langevin dynamics: Duality viewpoint and neural network optimization
(
Invited talk
)
SlidesLive Video The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the gradient Langevin dynamics (GLD) that minimizes an entropy regularized convex function defined on the space of probability distributions, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. In this talk, I will present the convergence result of MFLD and explain how the convergence of MFLD is connected to its dual objective. Indeed, its convergence is characterized by the log-Sobolev inequality of the so-called proximal Gibbs measure corresponding to the current solution. Based on this duality principle, we can construct several optimization methods with convergence guarantees including the particle dual averaging method and particle stochastic dual coordinate ascent method. Finally, I will provide a general framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and stochastic gradient approximation. |
🔗 |
Sat 2:35 p.m. - 3:00 p.m.
|
Len Spek: Duality for Neural Networks through Reproducing Kernel Banach Spaces
(
Invited talk
)
SlidesLive Video |
🔗 |
Sat 2:55 p.m. - 4:00 p.m.
|
Lunch Break
(
Break
)
|
🔗 |
Sat 4:00 p.m. - 5:30 p.m.
|
Poster session
(
Poster session
)
|
🔗 |
Sat 5:30 p.m. - 5:55 p.m.
|
Amy Zhang: Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
(
Invited talk
)
SlidesLive Video |
🔗 |
Sat 5:55 p.m. - 6:35 p.m.
|
Coffee Break / Poster session
(
Break
)
|
🔗 |
Sat 6:35 p.m. - 7:00 p.m.
|
Ehsan Amid: A Dualistic View of Activations in Deep Neural Networks
(
Invited talk
)
SlidesLive Video |
🔗 |
Sat 7:00 p.m. - 8:00 p.m.
|
Panel discussion
(
Panel discussion
)
SlidesLive Video |
🔗 |
-
|
Sparse Function-space Representation of Neural Networks
(
Poster
)
Deep neural networks are known to lack uncertainty estimates, struggle to incorporate new data, and suffer from catastrophic forgetting. We present a method that mitigates these issues by converting neural networks from weight-space to a low-rank function-space representation, via the so-called dual parameters. In contrast to previous work, our sparse representation captures the joint distribution over the entire data set, rather than only over a subset. This offers a compact and principled way of capturing uncertainty and enables us to incorporate new data without retraining whilst retaining predictive performance. We provide proof-of-concept demonstrations with the proposed approach for quantifying uncertainty in supervised learning on UCI benchmark tasks. |
Aidan Scannell · Riccardo Mereu · Paul Chang · Ella Tamir · Joni Pajarinen · Arno Solin 🔗 |
-
|
Memory Maps to Understand Models
(
Poster
)
What do models know and how? Answering this question requires exploratory analyses comparing many models, but existing techniques are specialized to specific models and analyses. We present memory maps as a general tool to understand a wide range of models by visualizing their sensitivity to data. Memory maps are extensions of residual-leverage plots where the two criteria are modified by easy-to-compute dual parameters obtained by using a Bayesian framework. The new criteria are used to understand a model's memory through a 2D scatter plot where tail regions often contain examples with high prediction-error and variance. All sorts of models can be analyzed this way, including not only those arising in kernel methods, Bayesian methods, and deep learning but also the ones obtained during training. We show use cases of memory maps to diagnose overfitting, compare various models, and analyze training trajectories. |
Dharmesh Tailor · Paul Chang · Siddharth Swaroop · Eric Nalisnick · Arno Solin · Khan Emtiyaz 🔗 |
-
|
Implicit Interpretation of Importance Weight Aware Updates
(
Poster
)
Due to its speed and simplicity, subgradient descent is one of the most used optimization algorithms in convex machine learning algorithms. However, tuning its learning rate is probably its most severe bottleneck to achieve consistent good performance. A common way to reduce the dependency on the learning rate is to use implicit/proximal updates. One such variant is the Importance Weight Aware (IWA) updates, which consist of infinitely many infinitesimal updates on each loss function. However, IWA updates' empirical success is not completely explained by their theory. In this paper, we show for the first time that IWA updates have a strictly better regret upper bound than plain gradient updates in the online learning setting. Our analysis is based on the new framework by Chen and Orabona (2023) to analyze generalized implicit updates using a dual formulation. In particular, our results imply that IWA updates can be considered as approximate implicit/proximal updates. |
Keyi Chen · Francesco Orabona 🔗 |
-
|
Learning with Primal-Dual Spectral Risk Measures: a Fast Incremental Algorithm
(
Poster
)
We consider learning with a generalization of rank-weighted objectives known as spectral risk measures (SRMs). SRMs, like other distributionally robust learning objectives, explicitly capture the worst-case performance over a set of possible deviations from the observed training distribution by introducing dual variables maximizing an adversarial objective. Even when the underlying (regularized) losses are smooth and strongly convex, direct stochastic gradient methods fail to converge to the SRM minimizer due in part to bias. In this work, we introduce a fast incremental optimization algorithm for SRMs that maintains a running estimate of the optimal dual variables by efficiently solving an approximation of the dual problem. Unlike related methods, our approach converges linearly for any smooth SRM and requires tuning a single hyperparameter: the (constant) primal learning rate. Empirically, our optimizer can achieve convergence within 2-3x fewer passes through the training set than recent baselines on distribution shift and fairness benchmarks. |
Ronak Mehta · Vincent Roulet · Krishna Pillutla · Zaid Harchaoui 🔗 |
-
|
Estimating Joint interventional distributions from marginal interventional data
(
Poster
)
We consider settings where only marginal datasets of a target variable and some treatment variable are available, but no joint observations of the target and all treatments together -- the so-called Causal Marginal Problem setting. In this paper we show how to exploit interventional data to acquire the joint conditional distribution of all the variables using the Maximum Entropy principle. To this end, we extend the Causal Maximum Entropy method to make use of interventional data in addition to observational data. Using Lagrange duality, we prove that the solution to the Causal Maximum Entropy problem with interventional constraints lies in the exponential family, as in the Maximum Entropy solution. Our method allows us to perform two tasks of interest when marginal interventional distributions are provided for all, or only some, of the variables. First, we show how to perform causal feature selection from a mixture of observational and single-variable interventional data, and, second, how to infer joint interventional distributions. For the former task, we show on synthetically generated data, that our proposed method outperforms the state-of-the-art method on merging datasets, and yields comparable results to the KCI-test which requires access to joint observations of all variables. |
Sergio Garrido Mejia · Elke Kirschbaum · Armin Kekić · Atalanti Mastakouri 🔗 |
-
|
RIFLE: Imputation and Robust Inference from Low Order Marginals
(
Poster
)
The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to ridge linear regression, where the resulting min-max problem is efficiently solved by applying the alternating direction method of multipliers (ADMM) on the dual problem. This framework can also be adapted to impute missing data. We compare RIFLE with state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) in numerical experiments. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. |
Sina Baharlouei · Kelechi Ogudu · Peng Dai · Sze-Chuan Suen · Meisam Razaviyayn 🔗 |
-
|
A Representer Theorem for Vector-Valued Neural Networks: Insights on Weight Decay Training and Widths of Deep Neural Networks
(
Poster
)
This paper characterizes the kinds of functions learned by multi-output (vector-valued) ReLU neural networks trained with weight decay.This extends previous results that were limited to single-output networks, which is crucial to understanding the effects of weight decay on deep neural networks (DNNs). The new characterization requires the definition of a new class of neural function spaces that we call vector-valued variation (VV) spaces. By exploiting the (Banach) duality between the space ofvector-valued measures and the space of vector-valued continuous functions, we prove that neural networks (NNs) are optimal solutions to learning problems posed over VV spaces via a novel representer theorem. Our representer theorem shows that solutions to these learning problems exist as vector-valued NNs with widths bounded in terms of the number of training samples. Next, via a novel connection to the multi-task lasso problem, we derive data-dependent bounds on the widths of homogeneous layers in DNNs. The bounds are determined by the effective dimensions of the training data embeddings in/out of the layers. These results shed new light on the regularity of DNN functions trained with weight decay as well as the kinds of architectures weight decay induces. |
Joseph Shenouda · Rahul Parhi · Kangwook Lee · Robert Nowak 🔗 |
-
|
Kernel Mirror Prox and RKHS Gradient Flow for Mixed Functional Nash Equilibrium
(
Poster
)
The theoretical analysis of machine learning algorithms, such as deep generative modeling, motivates multiple recent works on the Mixed Nash Equilibrium (MNE) problem.Different from MNE,this paper formulates theMixed Functional Nash Equilibrium (MFNE),which replaces one of the measure optimization problems with optimization over a class of dual functions, e.g., the reproducing kernel Hilbert space (RKHS) in the case of Mixed Kernel Nash Equilibrium (MKNE).We show that our MFNE and MKNE framework form the backbones that govern several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters.To model the infinite-dimensional continuous-limit optimization dynamics,we propose the Interacting Wasserstein-Kernel Gradient Flow, which includes the RKHS flow that is much less common than the Wasserstein gradient flow but enjoys a much simpler convexity structure.Time-discretizing this gradient flow, we propose a primal-dual kernel mirror prox algorithm, which alternates between a dual step in the RKHS, and a primal step in the space of probability measures.We then provide the first unified convergence analysis of our algorithm for this class of MKNE problems,which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case.As a case study, we apply our analysis to DRO, providing the first primal-dual convergence analysis for DRO with probability-metric constraints.
|
Pavel Dvurechenskii · Jia-Jie Zhu 🔗 |
-
|
Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum
(
Poster
)
Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement. |
Amnon Geifman · Daniel Barzilai · Ronen Basri · Meirav Galun 🔗 |
-
|
Energy-Based Non-Negative Tensor Factorization via Multi-Body Modeling
(
Poster
)
Tensor factorization has fundamental difficulties in rank tuning and optimization. To avoid these difficulties, we develop a rank-free energy-based tensor factorization, called many-body approximation, that allows intuitive modeling of tensors and global optimization. Our approach models tensors as distributions via the energy function, which describes interactions between modes, and a dually flat statistical manifold is induced. We visualize these interactions as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation. |
Kazu Ghalamkari · Mahito Sugiyama 🔗 |
-
|
A Dual Formulation for Probabilistic Principal Component Analysis
(
Poster
)
In this paper, we reformulate the Probabilistic Principal Component Analysis framework in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset. |
Henri De Plaen · Johan Suykens 🔗 |
-
|
Duality in Multi-View Restricted Kernel Machines
(
Poster
)
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full equivalence in primal and dual formulations by rescaling primal variables. Finally, we experimentally validate the equivalence and provide insight into the relationships between different methods on a number of time series data sets by recursively forecasting unseen test data and visualizing the learned features. |
Sonny Achten · Arun Pandey · Hannes De Meulemeester · Bart Moor · Johan Suykens 🔗 |
-
|
On the Fisher-Rao Gradient of the Evidence Lower Bound
(
Poster
)
This article studies the Fisher-Rao gradient, also referred to as the natural gradient, of the evidence lower bound, known from machine learning. Based on invariance properties of gradients within information geometry, it derives conditions on the underlying model that ensure the exactness for this bound. Information geometry is naturally based on duality concepts from differential geometry. |
Jesse van Oostrum · Nihat Ay 🔗 |
-
|
Duality Principle and Biologically Plausible Learning: Connecting the Representer Theorem and Hebbian Learning
(
Poster
)
A normative approach called Similarity Matching was recently introduced for deriving and understanding the algorithmic basis of neural computation focused on unsupervised problems. It involves deriving algorithms from computational objectives and evaluating their compatibility with anatomical and physiological observations. In particular, it introduces neural architectures by considering dual alternatives instead of primal formulations of popular models such as PCA. However, its connection to the Representer theorem remains unexplored. In this work, we propose to use teachings from this approach to explore supervised learning algorithms and clarify the notion of Hebbian learning. We examine regularized supervised learning and elucidate the emergence of neural architecture and additive versus multiplicative update rules. In this work, we focus not on developing new algorithms but on showing that the Representer theorem offers the perfect lens to study biologically plausible learning algorithms. We argue that many past and current advancements in the field rely on some form of dual formulation to introduce biological plausibility. In short, as long as a dual formulation exists, it is possible to derive biologically plausible algorithms. Our work sheds light on the pivotal role of the Representer theorem in advancing our comprehension of neural computation. |
Yanis Bahroun · Dmitri Chklovskii · Anirvan Sengupta 🔗 |
-
|
A max-affine spline approximation of neural networks using the Legendre transform of a convex-concave representation
(
Poster
)
This work presents a novel algorithm for transforming a neural network into a spline representation. Unlike previous work that required convex and piecewise-affine network operators to create a max-affine spline alternate form, this work relaxes this constraint. The only constraint is that the function be bounded and possess a well-define second derivative, although this was shown experimentally to not be strictly necessary. It can also be performed over the whole network rather than on each layer independently. As in previous work, this bridges the gap between neural networks and approximation theory but also enables the visualisation of network feature maps. Mathematical proof and experimental investigation of the technique is performed with approximation error and feature maps being extracted from a range of architectures, including convolutional neural networks. |
Adam Perrett · Danny Wood · Gavin Brown 🔗 |
-
|
Reward-Based Reinforcement Learning with Risk Constraints
(
Poster
)
Constrained optimization provides a common framework and solvers for dealing with conflicting objectives in the context of reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects possibly catastrophic events at the tails of the distribution, and it is often insufficient for many high-stake applications in which the risk involved in outliers is critical. In this work, we propose a new framework for risk-aware reinforcement learning which handles reward-based risk measures in both the objective and constraints which exhibits robustness properties jointly in time and space. The derivation makes connections between the convex analytic duality between return-based and reward-based formulations of RL and with Lagrangian duality, and can easily be implemented on top of existing RL algorithms. Furthermore, unlike the Lagrangian relaxations done in prior risk-constrained works, our framework admits an exact equivalence to the primal problem through a parameterized strong duality. Finally, we provide convergence guarantees of the proposed algorithm under common assumptions on the objective. |
Jane Lee · Konstantinos Nikolakakis · Dionysios Kalogerias · Amin Karbasi 🔗 |
-
|
The Power of Duality Principle in Offline Average-Reward Reinforcement Learning
(
Poster
)
Offline reinforcement learning (RL) is widely used to find an optimal policy using a pre-collected dataset, without further interaction with the environment. Recent RL theory has made significant progress in developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage, with specific focuses on either infinite-horizon discounted or finite-horizon episodic Markov decision processes (MDPs). In this work, we revisit the LP framework and the induced duality principle for offline RL, specifically for *infinite-horizon average-reward* MDPs. By virtue of this LP formulation and the duality principle, our result achieves the $\tilde O(1/\sqrt{n})$ near-optimal rate under partial data coverage assumptions. Our key enabler is to *relax* the equality *constraint* and introduce proper new *inequality constraints* in the dual formulation of the LP. We hope our insights can shed new lights on the use of LP formulations and the induced duality principle, in offline RL.
|
Asuman Ozdaglar · Sarath Pattathil · Jiawei Zhang · Kaiqing Zhang 🔗 |
-
|
Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees
(
Poster
)
Actor-critic (AC) methods are widely used in reinforcement learning (RL), and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO), and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems. |
Sharan Vaswani · Amirreza Kazemi · Reza Babanezhad · Nicolas Le Roux 🔗 |
-
|
Time-Reversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
(
Poster
)
In convex optimization, first-order optimization methods minimizing function values have been a central subject study since Nesterov's seminal work of 1983. Recently, however, Kim and Fessler's OGM-G and Lee et al.'s FISTA-G have been presented as alternatives that minimize the gradient magnitude instead. We present H-duality, which represents a surprising one-to-one correspondence between methods efficiently minimizing function values and methods efficiently minimizing gradient magnitude. In continuous-time formulations, H-duality corresponds to reversing the time dependence of the dissipation/friction term. To the best of our knowledge, H-duality is different from Lagrange/Fenchel duality and is distinct from any previously known duality or symmetry relations. Using H-duality, we obtain a clearer understanding of the symmetry between Nesterov's method and OGM-G, derive a new class of methods efficiently reducing gradient magnitudes of smooth convex functions, and find a new composite minimization method that is simpler and faster than FISTA-G. |
Jaeyeon Kim · Asuman Ozdaglar · Chanwoo Park · Ernest Ryu 🔗 |