Gradients and derivatives are integral to machine learning, as they enable gradient-based optimization. In many real applications, however, models rest on algorithmic components that implement discrete decisions, or rely on discrete intermediate representations and structures. These discrete steps are intrinsically non-differentiable and accordingly break the flow of gradients. To use gradient-based approaches to learn the parameters of such models requires turning these non-differentiable components differentiable. This can be done with careful considerations, notably, using smoothing or relaxations to propose differentiable proxies for these components. With the advent of modular deep learning frameworks, these ideas have become more popular than ever in many fields of machine learning, generating in a short time-span a multitude of "differentiable everything", impacting topics as varied as rendering, sorting and ranking, convex optimizers, shortest-paths, dynamic programming, physics simulations, NN architecture search, top-k, graph algorithms, weakly- and self-supervised learning, and many more.
Fri 12:00 p.m. - 12:10 p.m.
|
Opening Remarks
(
Remarks
)
link »
SlidesLive Video » |
Felix Petersen 🔗 |
Fri 12:10 p.m. - 12:45 p.m.
|
Invited Talk 1: Perturbed Optimizers for Learning
(
Invited Talk
)
SlidesLive Video » Quentin Berthet |
Quentin Berthet 🔗 |
Fri 12:45 p.m. - 1:20 p.m.
|
Invited Talk 2: Generalizing the Gumbel-Softmax with Stochastic Softmax Tricks
(
Invited Talk
)
SlidesLive Video » Dami Choi |
Dami Choi 🔗 |
Fri 1:20 p.m. - 1:40 p.m.
|
Coffee Break
|
🔗 |
Fri 1:40 p.m. - 2:15 p.m.
|
Invited Talk 3: Differentiable Learning modulo Formal Verification
(
Invited Talk
)
SlidesLive Video » Swarat Chaudhuri |
Swarat Chaudhuri 🔗 |
Fri 2:15 p.m. - 2:30 p.m.
|
Short Poster Talks 1
(
Short Poster Talks
)
SlidesLive Video » |
🔗 |
Fri 2:30 p.m. - 3:30 p.m.
|
Poster Session 1 ( Poster Session ) link » | 🔗 |
Fri 3:30 p.m. - 4:30 p.m.
|
Lunch Break
|
🔗 |
Fri 4:30 p.m. - 5:05 p.m.
|
Invited Talk 4: Blackbox Differentiation: the story so far
(
Invited Talk
)
SlidesLive Video » Marin Vlastelica |
Marin Vlastelica 🔗 |
Fri 5:05 p.m. - 5:40 p.m.
|
Invited Talk 5: On Differentiable Top-k Operators
(
Invited Talk
)
SlidesLive Video » Mathieu Blondel |
Mathieu Blondel 🔗 |
Fri 5:40 p.m. - 6:00 p.m.
|
Coffee Break
|
🔗 |
Fri 6:00 p.m. - 6:15 p.m.
|
Short Poster Talks 2
(
Short Poster Talks
)
SlidesLive Video » |
🔗 |
Fri 6:15 p.m. - 6:50 p.m.
|
Invited Talk 6: Differentiable Rendering and Beyond
(
Invited Talk
)
SlidesLive Video » Tzu-Mao Li |
Tzu-Mao Li 🔗 |
Fri 6:50 p.m. - 7:00 p.m.
|
Closing Remarks
(
Remarks
)
SlidesLive Video » |
Felix Petersen 🔗 |
Fri 7:00 p.m. - 8:00 p.m.
|
Poster Session 2 ( Poster Session ) link » | 🔗 |
-
|
End-to-end Differentiable Clustering with Associative Memories
(
Poster
)
link »
Clustering is a widely used unsupervised learning technique involving an intensive discrete optimization problem. Associative Memory models or AMs are differentiable neural networks defining a recursive dynamical system, which have been integrated with various deep learning architectures. We uncover a novel connection between the AM dynamics and the inherent discrete assignment necessary in clustering to propose a novel unconstrained continuous relaxation of the discrete clustering problem, enabling end-to-end differentiable clustering with AM, dubbed ClAM. Leveraging the pattern completion ability of AMs, we further develop a novel self-supervised clustering loss. Our evaluations on varied datasets demonstrate that ClAM benefits from the self-supervision, and significantly improves upon both the traditional Lloyd's k-means algorithm, and more recent continuous clustering relaxations (by upto 60\% in terms of the Silhouette Coefficient). |
Bishwajit Saha · Dmitry Krotov · Mohammed Zaki · Parikshit Ram 🔗 |
-
|
Optimizing probability of barrier crossing with differentiable simulators
(
Poster
)
link »
Simulating events that involve some energy barrier often requires us to promote the barrier crossing in order to increase the probability of the event. One example of such a system can be a chemical reaction which we propose to explore using differentiable simulations. Transition path discovery and estimation of the reaction barrier are merged into a single end-to-end problem that is solved by path-integral optimization. We show how the probability of transition can be formulated in a differentiable way and increase it by introducing a trainable position dependent bias function. We also introduce improvements over standard methods making DiffSim training stable and efficient. |
Martin Šípka · Johannes Dietschreit · Michal Pavelka · Lukáš Grajciar · Rafael Gomez-Bombarelli 🔗 |
-
|
From Perception to Programs: Regularize, Overparameterize, and Amortize
(
Poster
)
link »
We develop techniques for synthesizing neurosymbolic programs. Such programs mix discrete symbolic processing with continuous neural computation. We relax this mixed discrete/continuous problem and jointly learn all modules with gradient descent, and also incorporate amortized inference, overparameterization, and a differentiable strategy for penalizing lengthy programs. Collectedly this toolbox improves the stability of gradient-guided program search, and suggests ways of learning both how to parse continuous input into discrete abstractions, and how to process those abstractions via symbolic code |
Hao Tang · Kevin Ellis 🔗 |
-
|
Efficient Surrogate Gradients for Training Spiking Neural Networks
(
Poster
)
link »
Spiking Neural Network (SNN) is widely regarded as one of the next-generation neural network infrastructures, yet it suffers from an inherent non-differentiable problem that makes the traditional backpropagation (BP) method infeasible. Surrogate gradients (SG), which are an approximation to the shape of the Dirac's $\delta$-function, can help alleviate this issue to some extent. To our knowledge, the majority of research, however, keep a fixed surrogate gradient for all layers, ignorant of the fact that there exists a trade-off between the approximation to the delta function and the effective domain of gradients under the given dataset, hence limiting the efficiency of surrogate gradients and impairing the overall model performance. To guide the shape optimization in applying surrogate gradients for training SNN, we propose an indicator $k$, which represents the proportion of membrane potential with non-zero gradients in backpropagation. Further we present a novel $k$-based training pipeline that adaptively makes trade-offs between the surrogate gradients' shapes and its effective domain, followed by a series of ablation experiments for verification. Our algorithm achieves 68.93\% accuracy on the ImageNet dataset using SEW-ResNet34.Moreover, our method only requires extremely low external cost and can be simply integrated into the existing training procedure.
|
Hao Lin · Shikuang Deng · Shi Gu 🔗 |
-
|
Differentiable Tree Operations Promote Compositional Generalization
(
Poster
)
link »
In the context of structure-to-structure transformation tasks, learning sequences of discrete symbolic operations poses significant challenges due to their non-differentiability. To facilitate the learning of these symbolic sequences, we introduce a differentiable tree interpreter that compiles high-level symbolic tree operations into subsymbolic matrix operations on tensors. We present a novel Differentiable Tree Machine (DTM) architecture that integrates our interpreter with an external memory and an agent that learns to sequentially select tree operations to execute the target transformation in an end-to-end manner. With respect to out-of-distribution compositional generalization on synthetic semantic parsing and language generation tasks, DTM achieves 100% while existing baselines such as Transformer, Tree Transformer, LSTM, and Tree2Tree LSTM achieve less than 30%. DTM remains highly interpretable in addition to its perfect performance. |
Paul Soulos · Edward Hu · Kate McCurdy · Yunmo Chen · Roland Fernandez · Paul Smolensky · Jianfeng Gao 🔗 |
-
|
Plateau-Reduced Differentiable Path Tracing
(
Poster
)
link »
Current differentiable renderers provide light transport gradients with respect to arbitrary scene parameters. However, the mere existence of these gradients does not guarantee useful update steps in an optimization. Instead, inverse rendering might not converge due to inherent plateaus, i.e., regions of zero gradient, in the objective function. We propose to alleviate this by convolving the high-dimensional rendering function, that maps scene parameters to images, with an additional kernel that blurs the parameter space. We describe two Monte Carlo estimators to compute plateau-reduced gradients efficiently, i.e., with low variance, and show that these translate into net-gains in optimization error and runtime performance. Our approach is a straightforward extension to both black-box and differentiable renderers and enables optimization of problems with intricate light transport, such as caustics or global illumination, that existing differentiable renderers do not converge on. |
Michael Fischer · Tobias Ritschel 🔗 |
-
|
Differentiable Clustering and Partial Fenchel-Young Losses
(
Poster
)
link »
We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks. |
Lawrence Stewart · Francis Bach · Felipe Llinares-Lopez · Quentin Berthet 🔗 |
-
|
GeoPhy: Differentiable Phylogenetic Inference via Geometric Gradients of Tree Topologies
(
Poster
)
link »
Phylogenetic inference, grounded in molecular evolution models, is essential for understanding evolutionary relationships in biological data. While Variational Bayesian methods offer scalable models for biological analysis, reliable inference for latent tree topology and branch lengths remains challenging due to the vast possibilities for topological candidates. In response, we introduce GeoPhy, a novel approach that employs a fully differentiable formulation of phylogenetic inference, representing topological distributions in continuous geometric spaces without limiting topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies. |
Takahiro Mimori · Michiaki Hamada 🔗 |
-
|
Differentiable sorting for censored time-to-event data
(
Poster
)
link »
Survival analysis is a crucial semi-supervised task in machine learning with significant real-world applications, especially in healthcare. It is known that survival analysis can be reduced to a ranking task and be learnt with ordering supervision. Differentiable sorting methods have been shown to be effective in this area but are unable to handle censored orderings. To combat this, we propose Diffsurv, which predicts matrices of \emph{possible} permutations that accommodate the label uncertainty introduced by censored samples. Our experiments reveal that Diffsurv matches or outperforms established baselines in various semi-simulated and real-world risk prediction scenarios. |
Andre Vauvelle · Benjamin Wild · Roland Eils · Spiros Denaxas 🔗 |
-
|
Latent Random Steps as Relaxations of Max-Cut, Min-Cut, and More
(
Poster
)
link »
Algorithms for node clustering typically focus on finding homophilous structure in graphs. That is, they find sets of similar nodes with many edges within, rather than across, the clusters. However, graphs often also exhibit heterophilous structure, as exemplified by (nearly) bipartite and tripartite graphs, where most edges occur across the clusters. Grappling with such structure is typically left to the task of graph simplification. We present a probabilistic model based on non-negative matrix factorization which unifies clustering and simplification, and provides a framework for modeling arbitrary graph structure. Our model factorizes the process of taking a random walk on the graph, and it permits an unconstrained parametrization, allowing for optimization via simple gradient descent. By relaxing the hard clustering to a soft clustering, our algorithm relaxes potentially hard clustering problems to a tractable ones. We illustrate our model and algorithm's capabilities on a synthetic graph, as well as simple unsupervised learning tasks involving bipartite and tripartite clustering of orthographic and phonological data. |
Sudhanshu Chanpuriya · Cameron Musco 🔗 |
-
|
Distributions for Compositionally Differentiating Parametric Discontinuities
(
Poster
)
link »
Computations in computer graphics, robotics, and probabilistic inference often require differentiating integrals with discontinuous integrands. Popular differentiable programming languages do not support the differentiation of these integrals. To address this problem, we extend distribution theory to provide semantic definitions for a broad class of programs in a programming language, Potto. Potto can differentiate parametric discontinuities under integration, and it also supports first-order functions and compositional evaluation. We formalize the meaning of programs using denotational semantics and the evaluation of programs using operational semantics. We prove correctness theorems about the semantics and prove that the operational semantics are compositional, enabling separate compilation of programs and overcoming compile-time bottlenecks. Using Potto, we implement a prototype differentiable renderer with separately compiled shaders. |
Jesse Michel · Kevin Mu · Xuanda Yang · Sai Praveen Bangaru · Elias Rojas Collins · Gilbert Bernstein · Jonathan Ragan-Kelley · Michael Carbin · Tzu-Mao Li 🔗 |
-
|
Stochastic Gradient Bayesian Optimal Experimental Designs for Simulation Based Inference
(
Poster
)
link »
Simulation-based inference (SBI) methods tackle complex scientific models with challenging inverse problems. However, SBI models often face a significant hurdle due to their non-differentiable nature, which hampers the use of gradient-based optimization techniques. Bayesian Optimal Experimental Design (BOED) is a powerful approach that aims to make the most efficient use of experimental resources for improved inferences. While stochastic gradient BOED methods have shown promising results in high-dimensional design problems, they have mostly neglected the integration of BOED with SBI due to the difficult non-differentiable property of many SBI simulators. In this work, we establish a crucial connection between ratio-based SBI inference algorithms and stochastic gradient-based variational inference by leveraging mutual information bounds. This connection allows us to extend BOED to SBI applications, enabling the simultaneous optimization of experimental designs and amortized inference functions. We demonstrate our approach on a simple linear model and offer implementation details for practitioners. |
Vincent Zaballa · Elliot Hui 🔗 |
-
|
PDP: Parameter-free Differentiable Pruning is All You Need
(
Poster
)
link »
In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision models. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. PDP also improved the top-1 ImageNet1k accuracy of ResNet18 by over 3.6% and reduced the top-1 ImageNet1k accuracy of ResNet50 by 0.6% from the state-of-the-art. |
Minsik Cho · Saurabh Adya · Devang Naik 🔗 |
-
|
EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture Search
(
Poster
)
link »
In hardware-aware Differentiable Neural Architecture Search (DNAS), it is challenging to integrate hardware metrics into network architecture search. To handle hardware metrics, such as inference latency, existing works mainly rely on linear approximations and lack of support for various customized hardware. In this work, we propose End-to-end Hardware-aware DNAS (EH-DNAS), a seamless integration of an end-to-end hardware performance differentiable approximation, and a fully automated DNAS to deliver hardware-efficient deep neural networks on various hardware, including Edge GPUs, Edge TPUs, Mobile CPUs, and customized accelerators. Given a targeted hardware platform, we propose to learn a differentiable model predicting the end-to-end hardware performance of the neural network architectures during DNAS. We also propose E2E-Perf, a benchmarking tool to expand our design to support customized accelerators. Experiments on CIFAR10 and ImageNet show that EH-DNAS improves the hardware performance by an average of 1.5 times on customized accelerators and existing hardware processors than the state-of-the-art efficient networks while maintaining highly competitive model inference accuracy. |
Qian Jiang · Xiaofan Zhang · Deming Chen · Minh Do · Raymond A. Yeh 🔗 |
-
|
Differentiable Forward Projector for X-ray Computed Tomography
(
Poster
)
link »
Data-driven deep learning has been successfully applied to various computed tomographic reconstruction problems. The deep inference models may outperform existing analytical and iterative algorithms, especially in ill-posed CT reconstruction. However, those methods often predict images that do not agree with the measured projection data. This paper presents an accurate differentiable forward and back projection software library to ensure the consistency between the predicted images and the original measurements. The software library efficiently supports various projection geometry types while minimizing the GPU memory footprint requirement, which facilitates seamless integration with existing deep learning training and inference pipelines. The proposed software is available as open source: https://github.com/LLNL/LEAP. |
Hyojin Kim · Kyle Champley 🔗 |
-
|
Differentiable Search of Evolutionary Trees from Leaves
(
Poster
)
link »
Inferring the most probable evolutionary tree given leaf nodes is an important problem in computational biology that reveals the evolutionary relationships between species. Due to the exponential growth of possible tree topologies, finding the best tree in polynomial time becomes computationally infeasible. In this work, we propose a novel differentiable approach as an alternative to traditional heuristic-based combinatorial tree search methods in phylogeny. The optimization objective of interest in this work is to find the most parsimonious tree (i.e., to minimize the total number of evolutionary changes in the tree). We empirically evaluate our method using randomly generated trees of up to 128 leaves, with each node represented by a 256-length protein sequence. Our method exhibits promising convergence ($<1$% error for trees up to 32 leaves, $<8$% error up to 128 leaves, given only leaf node information), illustrating its potential in much broader phylogenetic inference problems and possible integration with end-to-end differentiable models. The code to reproduce the experiments in this paper can be found at https://github.ramith.io/diff-evol-tree-search.
|
Ramith Hettiarachchi · Sergey Ovchinnikov 🔗 |
-
|
Koopman Constrained Policy Optimization: A Koopman operator theoretic method for differentiable optimal control in robotics
(
Poster
)
link »
We introduce Koopman Constrained Policy Optimization (KCPO), combining implicitly differentiable model predictive control with a deep Koopman autoencoder for robot learning in unknown and nonlinear dynamical systems. KCPO is a new policy optimization algorithm that trains neural policies end-to-end with hard box constraints on controls. Guaranteed satisfaction of hard constraints helps ensure the performance and safety of robots. We perform imitation learning with KCPO to recover expert policies on the Simple Pendulum, Cartpole Swing-Up, Reacher, and Differential Drive environments, outperforming baseline methods in generalizing to out-of-distribution constraints in most environments after training. |
Matthew Retchin · Brandon Amos · Steven Brunton · Shuran Song 🔗 |
-
|
Sample-efficient learning of auditory object representations using differentiable impulse response synthesis
(
Poster
)
link »
Many of the sounds we hear in daily life are generated by contact between objects. Rigid objects are often well approximated as linear systems, such that impulse responses can be used to predict their vibrational behavior. Impulse responses carry information about material and shape. Previous research has shown that impulse responses measured from objects can be used to generate realistic impact, scraping and rolling sounds. However, it has been unclear how to efficiently synthesize impulse responses for objects of a particular material and size. Here we present an analysis-by-synthesis technique that uses a differentiable impulse response synthesis model to infer generative parameters of a measured impulse response. Then, we introduce a way of representing auditory material as distributions in the generative parameter space. Object impulse responses can be sampled from these distributions to render convincingly realistic contact sounds. |
Vinayak Agarwal · James Traer · Josh Mcdermott 🔗 |
-
|
TaskMet: Task-Driven Metric Learning for Model Learning
(
Poster
)
link »
Deep learning models are often used with some downstream task. Models solely trained to achieve accurate predictions may struggle to perform well on thedesired downstream tasks. We propose using the task'sloss to learn a metric which parameterizes a loss to train the model.This approach does not alter the optimal prediction modelitself, but rather changes the model learning to emphasizethe information important for the downstream task.This enables us to achieve the best of both worlds:a prediction model trained in the original prediction space whilealso being valuable for the desired downstream task.We validate our approach through experimentsconducted in two main settings: 1) decision-focused model learningscenarios involving portfolio optimization and budget allocation, and2) reinforcement learning in noisy environments with distractingstates. |
Dishank Bansal · Ricky T. Q. Chen · Mustafa Mukadam · Brandon Amos 🔗 |
-
|
Lagrangian Proximal Gradient Descent for Learning Convex Optimization Models
(
Poster
)
link »
We propose Lagrangian Proximal Gradient Descent (LPGD), a flexible framework for learning convex optimization models. Like traditional proximal gradient methods, LPGD can be interpreted as optimizing a smoothed envelope of the possibly non-differentiable loss. The smoothening allows training models that do not provide informative gradients, such as discrete optimization models.We show that the LPGD update can efficiently be computed by rerunning the forward solver on a perturbed input.Moreover, we prove that the LPGD update converges to the gradient as the smoothening parameter approaches zero.Finally, we experimentally investigate the potential benefits of applying LPGD even in a fully differentiable setting. |
Anselm Paulus · Vit Musil · Georg Martius 🔗 |
-
|
Some challenges of calibrating differentiable agent-based models
(
Poster
)
link »
Agent-based models (ABMs) are a promising approach to modelling and reasoning about complex systems, yet their application in practice is impeded by their complexity, discrete nature, and the difficulty of performing parameter inference and optimisation tasks. This in turn has sparked interest in the construction of differentiable ABMs as a strategy for combatting these difficulties, yet a number of challenges remain. In this paper, we discuss and present experiments that highlight some of these challenges, along with potential solutions. |
Arnau Quera Bofarull · Joel Dyer · Anisoara Calinescu · Michael Wooldridge 🔗 |
-
|
Differentiable MaxSAT Message Passing
(
Poster
)
link »
The message-passing principle is used in the most popular neural networks for graph-structured data. However, current message-passing approaches use black-box neural models that transform features over continuous domain, thus limiting the description capability of GNNs.In this work, we explore a novel type of message passing based on a differentiable satisfiability solver. Our model learns logical rules thatencode which and how messages are passed from one node to another node. The rules are learned in a relaxed continuous space, which renders the training process end-to-end differentiable and thus enables standard gradient-based training. In our experiments we show that MAXSAT-MP learns arithmetic operations and that is on par with state-of-the-art GNNs on graph structured data. |
Francesco Alesiani · Cristóbal Corvalán Morbiducci · Markus Zopf 🔗 |
-
|
SIMPLE: A Gradient Estimator for $k$-subset Sampling
(
Poster
)
link »
$k$-subset sampling is ubiquitous in machine learning, enabling regularization and interpretability through sparsity.The challenge lies in rendering $k$-subset sampling amenable to end-to-end learning.This has typically involved relaxing the reparameterized samples to allow for backpropagation, with the risk of introducing high bias and high variance.In this work, we fall back to discrete $k$-subset sampling on the forward pass.This is coupled with using the gradient with respect to the exact marginals, computed efficiently, as a proxy for the true gradient.We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators,including the straight-through Gumbel estimator when $k=1$.Empirical results show improved performance on learning to explain and sparse linear regression.We give an algorithm computing the exact ELBO for the $k$-subset distribution, obtaining significantly lower loss than SOTA.
|
Kareem Ahmed · Zhe Zeng · Mathias Niepert · Guy Van den Broeck 🔗 |
-
|
Interpretable Neural-Symbolic Concept Reasoning
(
Poster
)
link »
Deep learning methods are highly accurate, yet their opaque decision process prevents them from earning full human trust. Concept-based models aim to address this issue by learning tasks based on a set of human-understandable concepts. However, state-of-the-art concept-based models rely on high-dimensional concept embedding representations which lack a clear semantic meaning, thus questioning the interpretability of their decision process. To overcome this limitation, we propose the Deep Concept Reasoner (DCR), the first interpretable concept-based model that builds upon concept embeddings. In DCR, neural networks do not make task predictions directly, but they build syntactic rule structures using concept embeddings. DCR then executes these rules on meaningful concept truth degrees to provide a final interpretable and semantically-consistent prediction in a differentiable manner. Our experiments show that DCR improves up to +25% w.r.t. state-of-the-art interpretable concept-based models on challenging benchmarks, and discovers meaningful logic rules matching known ground truths even in the absence of concept supervision during training. |
Pietro Barbiero · Gabriele Ciravegna · Francesco Giannini · Mateo Espinosa Zarlenga · Lucie Charlotte Magister · Alberto Tonda · Pietro Lió · Frederic Precioso · Mateja Jamnik · Giuseppe Marra 🔗 |
-
|
Dilated Convolution with Learnable Spacings: beyond bilinear interpolation
(
Poster
)
link »
Dilated Convolution with Learnable Spacings (DCLS) is a recently proposed variation of the dilated convolution in which the spacings between the non-zero elements in the kernel, or equivalently their positions, are learnable. Non-integer positions are handled via interpolation. Thanks to this trick, positions have well-defined gradients. The original DCLS used bilinear interpolation, and thus only considered the four nearest pixels. Yet here we show that longer range interpolations, and in particular a Gaussian interpolation, allow improving performance on ImageNet1k classification on two state-of-the-art convolutional architectures (ConvNeXt and Conv-Former), without increasing the number of parameters. The method code is based on PyTorch and is available at https://github.com/K-H-Ismail/Dilated-Convolution-with-Learnable-Spacings-PyTorch. |
Ismail Khalfaoui Hassani · Thomas Pellegrini · Timothée Masquelier 🔗 |
-
|
Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning
(
Poster
)
link »
Many real-world optimization problems contain unknown parameters that must be predicted prior to solving. To train the predictive machine learning (ML) models involved, the commonly adopted approach focuses on maximizing predictive accuracy. However, this approach does not always lead to the minimization of the downstream task loss. Decision-focused learning (DFL) is a recently proposed paradigm whose goal is to train the ML model by directly minimizing the task loss. However, state-of-the-art DFL methods are limited by the assumptions they make about the structure of the optimization problem (e.g., that the problem is linear) and by the fact that can only predict parameters that appear in the objective function. In this work, we address these limitations by instead predicting \textit{distributions} over parameters and adopting score function gradient estimation (SFGE) to compute decision-focused updates to the predictive model, thereby widening the applicability of DFL. Our experiments show that by using SFGE we can: (1) deal with predictions that occur both in the objective function and in the constraints; and (2) effectively tackle two-stage stochastic optimization problems. |
Mattia Silvestri · Senne Berden · Jayanta Mandi · Ali Mahmutoğulları · Maxime Mulamba Ke Tchomba · Allegra De Filippo · Tias Guns · Michele Lombardi 🔗 |
-
|
Probabilistic Task-Adaptive Graph Rewiring
(
Poster
)
link »
Message-passing graph neural networks (MPNNs) emerged as powerful tools for processing graph-structured input. However, they operate on a fixed graph structure, ignoring potential noise and missing information. In addition, due to their purely local aggregation mechanism, they are susceptible to phenomena such as over-smoothing, over-squashing, or under-reaching. Hence, devising principled approaches for learning to focus on graph structure relevant to the given prediction task remains an open challenge. In this work, leveraging recent progress in differentiable $k$-subset sampling, we devise a novel task-adaptive graph rewiring approach, which learns to add relevant edges while omitting less beneficial ones. We empirically demonstrate on synthetic datasets that our approach effectively alleviates the issues of over-squashing and under-reaching. In addition, on established real-world datasets, we demonstrate that our method is competitive or superior to conventional MPNN models and graph transformer architectures regarding predictive performance and computational~efficiency.
|
Chendi Qian · Andrei Manolache · Kareem Ahmed · Zhe Zeng · Guy Van den Broeck · Mathias Niepert · Christopher Morris 🔗 |
-
|
Differentiable Sampling of Categorical Distributions Using the CatLog-Derivative Trick
(
Poster
)
link »
Categorical random variables can faithfully represent the discrete and uncertain aspects of data as part of a discrete latent variable model. Learning in such models necessitates taking gradients with respect to the parameters of the categorical probability distributions, which is often intractable due to their combinatorial nature. A popular technique to estimate these otherwise intractable gradients is the Log-Derivative trick. This trick forms the basis of the well-known REINFORCE gradient estimator and its many extensions. While the Log-Derivative trick allows us to differentiate through samples drawn from categorical distributions, it does not take into account the discrete nature of the distribution itself. Our first contribution addresses this shortcoming by introducing the CatLog-Derivative trick -- a variation of the Log-Derivative trick tailored towards categorical distributions. Secondly, we use the CatLog-Derivative trick to introduce IndeCateR, a novel and unbiased gradient estimator for the important case of products of independent categorical distributions with provably lower variance than REINFORCE. Thirdly, we empirically show that IndeCateR can be efficiently implemented and that its gradient estimates have significantly lower bias and variance for the same number of samples compared to the state of the art. |
Lennert De Smet · EMANUELE SANSONE · Pedro Zuidberg Dos Martires 🔗 |
-
|
SelMix: Selective Mixup Fine Tuning for Optimizing Non-Decomposable Metrics
(
Poster
)
link »
Natural data often has class imbalance. This can make it difficult for machine learning models to learn to classify minority classes accurately. In-dustrial machine-learning applications often have objectives beyond just accuracy. For example, models may be required to meet certain fairness criteria, such as not being biased against the classes with fewer samples. These objectives are often non-decomposable in nature. SelMix is a fine-tuning technique that can be used to improve the performance of machine learning models on imbalanced data. The core idea of our framework is to determine a sampling distribution to performa mixup of features between samples from particular classes such that it optimizes the given objective. We evaluate our technique against the existing empirical methods on standard benchmark datasets for imbalanced classification. |
shrinivas ramasubramanian · Harsh Rangwani · Sho Takemori · Kunal Samanta · Yuhei Umeda · Venkatesh Babu Radhakrishnan 🔗 |
-
|
Dynamic Control of Queuing Networks via Differentiable Discrete-Event Simulation
(
Poster
)
link »
Queuing network control is a problem that arises in many applications such as manufacturing, communications networks, call centers, hospital systems, etc. Reinforcement Learning (RL) offers a broad set of tools for training controllers for general queuing networks, but standard model-free approaches suffer from high variance of trajectories, large state and action spaces, and instability. In this work, we develop a modeling framework for queuing networks based on discrete-event simulation. This model allows us to leverage tools from the gradient estimation literature to compute approximate first-order gradients of sample-path performance metrics through auto-differentiation, despite discrete dynamics of the system. Using this framework, we derive gradient-based RL algorithms for policy optimization and planning. We observe that these methods improve sample efficiency, stabilize the system even when starting from a random initialization, and are capable of handling non-stationary, large-scale instances. |
Ethan Che · Hongseok Namkoong · Jing Dong 🔗 |
-
|
A Unified Approach to Count-Based Weakly-Supervised Learning
(
Poster
)
link »
High-quality labels are often very scarce, whereas unlabeled data with inferred weak labels occurs more naturally. In many cases, these weak labels dictate the frequency of each respective class over a set of instances. In this paper, we develop a unified approach to learning from such weakly-labeled data, which we call *count-based weakly-supervised learning*. At the heart of our approach is the ability to compute the probability of exactly $k$ out of $n$ outputs being set to true. This computation is differentiable, exact, and efficient. Building upon the previous computation, we derive a *count loss* penalizing the model for deviations in its distribution from an arithmetic constraint defined over label counts. We evaluate our approach on three common weakly-supervised learning paradigms and observe that our proposed approach achieves state-of-the-art or highly competitive results across all three of the paradigms.
|
Vinay Shukla · Zhe Zeng · Kareem Ahmed · Guy Van den Broeck 🔗 |
-
|
Data Models for Dataset Drift Controls in Machine Learning With Optical Images
(
Poster
)
link »
This study addresses robustness concerns in machine learning due to dataset drift by integrating physical optics with machine learning to create explicit, differentiable data models. These models illuminate the impact of data generation on model performance and facilitate drift synthesis, precise tolerancing of model sensitivity (drift forensics), and beneficial drift creation (drift optimization). Accompanying the study are two datasets, Raw-Microscopy and Raw-Drone, available at https://github.com/aiaudit-org/raw2logit.Note: The full-length archival version of this manuscript can be found in the Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=I4IkGmgFJz. |
Luis Oala · Marco Aversa · Gabriel Nobis · Kurt Willis · Yoan Neuenschwander · Michèle Buck · Christian Matek · Jerome Extermann · Enrico Pomarico · Wojciech Samek · Roderick Murray-Smith · Christoph Clausen · Bruno Sanguinetti
|
-
|
A Gradient Flow Modification to Improve Learning from Differentiable Quantum Simulators
(
Poster
)
link »
Propagating gradients through differentiable simulators allows to improve the training of deep learning architectures. We study an example from quantum physics that, at first glance, seems not to benefit from such gradients. Our analysis shows the problem is rooted in a mismatch between the specific form of loss functions used in quantum physics and its gradients; the gradient can vanish for non-equal states. We propose to add a scaling term to fix this problematic gradient flow and regain the benefits of gradient-based optimization. We chose two experiments on the Schroedinger equation, a prediction and a control task, to demonstrate the potential of our method. |
Patrick Schnell · Nils Thuerey 🔗 |
-
|
Differentiating Metropolis-Hastings to Optimize Intractable Densities
(
Poster
)
link »
We develop an algorithm for automatic differentiation of Metropolis-Hastings samplers, allowing us to differentiate through probabilistic inference, even if the model has discrete components within it. Our approach fuses recent advances in stochastic automatic differentiation with traditional Markov chain coupling schemes, providing an unbiased and low-variance gradient estimator. This allows us to apply gradient-based optimization to objectives expressed as expectations over intractable target densities. We demonstrate our approach by finding an ambiguous observation in a Gaussian mixture model and by maximizing the specific heat in an Ising model. |
Gaurav Arya · Ruben Seyer · Frank Schäfer · Kartik Chandra · Alexander Lew · Mathieu Huot · Vikash Mansinghka · Jonathan Ragan-Kelley · Christopher Rackauckas · Moritz Schauer 🔗 |
-
|
A Short Review of Automatic Differentiation Pitfalls in Scientific Computing
(
Poster
)
link »
Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs. While AD has been successfully used in countless engineering, science and machine learning applications, it can sometimes nevertheless produce surprising results. In this paper we categorize problematic usages of AD and illustrate each category with examples such as chaos, time-averages, discretizations, fixed-point loops, lookup tables, linear solvers, and probabilistic programs, in the hope that readers may more easily avoid or detect such pitfalls. |
Jan Hueckelheim · Harshitha Menon · William Moses · Bruce Christianson · Paul Hovland · Laurent Hascoet 🔗 |
-
|
Lossless hardening with $\partial\mathbb{B}$ nets
(
Poster
)
link »
$\partial\mathbb{B}$ nets are differentiable neural networks that learn discrete boolean-valued functions by gradient descent. $\partial\mathbb{B}$ nets have two semantically equivalent aspects: a differentiable soft-net, with real weights, and a non-differentiable hard-net, with boolean weights. We train the soft-net by backpropagation and then "harden" the learned weights to yield boolean weights that bind with the hard-net. The result is a learned discrete function. Unlike existing approaches to neural network binarization the "hardening" operation involves no loss of accuracy. Preliminary experiments demonstrate that $\partial\mathbb{B}$ nets achieve comparable performance on standard machine learning problems yet are compact (due to 1-bit weights) and interpretable (due to the logical nature of the learnt functions).
|
Ian Wright 🔗 |
-
|
Learning Observation Models with Incremental Non-Differentiable Graph Optimizers in the Loop for Robotics State Estimation
(
Poster
)
link »
We consider the problem of learning observation models for robot state estimation with incremental non-differentiable optimizers in the loop. Convergence to the correct belief over the robot state is heavily dependent on a proper tuning of observation models which serve as input to the optimizer. We propose a gradient-based learning method which converges much quicker to model estimates that lead to solutions of much better quality compared to an existing state-of-the-art method as measured by the tracking accuracy over unseen robot test trajectories. |
Mohamad Qadri · Michael Kaess 🔗 |
-
|
Differentiable Set Partitioning
(
Poster
)
link »
Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems.However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters.We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks.This new approach enables reparameterized gradients with respect to the parameters of the new random partition model.Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order.We highlight the versatility of our general-purpose approach on two different challenging experiments: multitask learning and inference of shared and independent generative factors under weak supervision. |
Thomas Sutter · Alain Ryser · Joram Liebeskind · Julia Vogt 🔗 |
-
|
Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information
(
Poster
)
link »
Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. Training the optimizer can be done with supervision from known optimal solutions (not always available) or implicitly by optimizing the compound function $f\circ \mathbf{g}$, but the implicit approach is slow and challenging due to frequent calls to the optimizer and sparse gradients, particularly for combinatorial solvers. To address these challenges, we propose using a smooth and learnable **Landscape Surrogate** $\mathcal{M}$ instead of $f\circ \mathbf{g}$. This surrogate can be computed faster than $\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems and real-world problems, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.
|
Arman Zharmagambetov · Brandon Amos · Aaron Ferber · Taoan Huang · Bistra Dilkina · Yuandong Tian 🔗 |
-
|
Investigating Axis-Aligned Differentiable Trees through Neural Tangent Kernels
(
Poster
)
link »
Axis-aligned rules are known to induce an important inductive bias in machine learning models such as typical hard decision tree ensembles. However, theoretical understanding of the learning behavior is largely unrevealed due to the discrete nature of rules. To address this issue, we impose the axis-aligned constraint on differentiable decision trees, or soft trees, which relax the splitting process of decision trees and are trained using the gradient method. The differentiable property enables us to derive their Neural Tangent Kernel (NTK) that can analytically describe the training behavior. Two cases are realized: imposing the axis-aligned constraint throughout the entire training process, or only at the initial state. Moreover, we extend the NTK framework to handle various tree architectures simultaneously, and prove that any axis-aligned non-oblivious tree ensemble can be transformed into an axis-aligned oblivious tree ensemble with the same limiting NTK. By excluding non-oblivious trees from the search space, the cost of trial-and-error procedures required for model selection can be massively reduced. |
Ryuichi Kanoh · Mahito Sugiyama 🔗 |
-
|
PMaF: Deep Declarative Layers for Principal Matrix Features
(
Poster
)
link »
We explore two differentiable deep declarative layers, namely least squares on sphere (LESS) and implicit eigen decomposition (IED), for learning the principal matrix features (PMaF). It can be used to represent data features with a low-dimensional vector containing dominant information from a high-dimensional matrix. We first solve the problems with iterative optimization in the forward pass and then backpropagate the solution for implicit gradients under a bi-level optimization framework. Particularly, adaptive descent steps with the backtracking line search method and descent decay in the tangent space are studied to improve the forward pass efficiency of LESS. Meanwhile, exploited data structures are used to greatly reduce the computational complexity in the backward pass of LESS and IED. Empirically, we demonstrate the superiority of our layers over the off-the-shelf baselines by comparing the solution optimality and computational requirements. |
Zhiwei Xu · Hao Wang · Yanbin Liu · Stephen Gould 🔗 |
-
|
JAX FDM: A differentiable solver for inverse form-finding
(
Poster
)
link »
We introduce JAX FDM, a differentiable solver to design mechanically efficient shapes for 3D structures, such as domes, cable nets and towers, conditioned on target architectural, fabrication and structural properties. JAX FDM solves these inverse form-finding problems by combining the force density method, differentiable sparsity and gradient-based optimization. JAX FDM can be paired with optimization and neural network libraries in the JAX ecosystem to facilitate the integration of form-finding simulations into neural networks. We showcase the features of JAX FDM in two structural design examples. JAX FDM is available as an open-source library. |
Rafael Pastrana · Deniz Oktay · Ryan P. Adams · Sigrid Adriaenssens 🔗 |
-
|
Fine-Tuning Language Models with Just Forward Passes
(
Poster
)
link »
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZOto fine-tune huge models, despite classical ZO analyses suggesting otherwise. |
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Jason Lee · Danqi Chen · Sanjeev Arora 🔗 |
-
|
Differentiable Causal Discovery with Smooth Acyclic Orientations
(
Poster
)
link »
Most differentiable causal discovery approaches constrain or regularize an optimization problem using a continuous relaxation of the acyclicity property. The cost of computing the relaxation is cubic on the number of nodes and thus affects the scalability of such techniques. In this work, we introduce COSMO, the first quadratic and constraint-free continuous optimization scheme. COSMO represents a directed acyclic graph as a priority vector on the nodes and an adjacency matrix. We prove that the priority vector represents a differentiable approximation of the acyclic orientation of the graph, and we demonstrate the existence of an upper bound on the orientation acyclicity. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performs comparably to constrained methods for graph discovery. |
Riccardo Massidda · Francesco Landolfi · Martina Cinquini · Davide Bacciu 🔗 |
-
|
DNArch: Learning Convolutional Neural Architectures by Backpropagation
(
Poster
)
link »
We present Differentiable Neural Architectures (DNArch), a method that learns the weights and the architecture of CNNs jointly by backpropagation. DNArch enables learning (i) the size of convolutional kernels, (ii) the width of all layers, (iii) the position and value of downsampling layers, and (iv) the depth of the network. DNArch treats neural architectures as continuous entities and uses learnable differentiable masks to control their size. Unlike existing methods, DNArch is not limited to a (small) predefined set of possible components, but instead it is able to discover CNN architectures across all feasible combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds effective architectures for classification and dense prediction tasks on sequential and image data. By adding a loss term that controls the network complexity, DNArch constrains its search to architectures that respect a predefined computational budget during training. |
David Romero · Neil Zeghidour 🔗 |
-
|
Towards Understanding Gradient Approximation in Equality Constrained Deep Declarative Networks
(
Poster
)
link »
We explore conditions for when the gradient of a deep declarative node can be approximated by ignoring constraint terms and still result in a descent direction for the global loss function. This has important practical application when training deep learning models since the approximation is often computationally much more efficient than the true gradient calculation. We provide theoretical analysis for problems with linear equality constraints and normalization constraints, and show examples where the approximation works well in practice as well as some cautionary tales for when it fails. |
Stephen Gould · Ming Xu · Zhiwei Xu · Yanbin Liu 🔗 |