Workshop
Continuous Time Perspectives in Machine Learning
Mihaela Rosca · Chongli Qin · Julien Mairal · Marc Deisenroth
Room 321  323
In machine learning, discrete time approaches such as gradient descent algorithms and discrete building layers for neural architectures have traditionally dominated. Recently, we have seen that by bridging these discrete systems with their continuous counterparts we can not only develop new insights but we can construct novel and competitive ML approaches. By leveraging time, we can tap into the centuries of research such as dynamical systems, numerical integration and differential equations, and continue enhancing what is possible in ML.The workshop aims to to disseminate knowledge about the use of continuous time methods in ML; to create a discussion forum and create a vibrant community around the topic; to provide a preview of what dynamical system methods might further bring to ML; to find the biggest hurdles in using continuous time systems in ML and steps to alleviate them; to showcase how continuous time methods can enable ML to have large impact in certain application domains, such as climate prediction and physical sciences.Recent work has shown that continuous time approaches can be useful in ML, but their applicability can be extended by increasing the visibility of these methods, fostering collaboration and an interdisciplinary approach to ensure their longlasting impact. We thus encourage submissions with a varied set of topics: the intersection of machine learning and continuoustime methods; the incorporation of knowledge of continuous systems to analyse and improve on discrete approaches; the exploration of approaches from dynamical systems and related fields to machine learning; the software tools from the numerical analysis community.We have a diverse set of confirmed speakers and panellists with expertise in architectures, optimisation, RL, generative models, numerical analysis, gradient flows and climate. We hope this will foster an interdisciplinary and collaborative environment cohesive for the development of new research ideas.
Schedule
Sat 6:00 a.m.  6:40 a.m.

Deep neural network approximations for PDEs
(
Invited Talk
)
SlidesLive Video Most of the numerical approximation methods for PDEs in the scientific literature suffer from the socalled curse of dimensionality (CoD) in the sense that the number of computational operations and/or the number of parameters employed in the corresponding approximation scheme grows exponentially in the PDE dimension and/or the reciprocal of the desired approximation precision. Recently, certain deep learningbased approximation methods for PDEs have been proposed and various numerical simulations for such methods suggest that deep neural network (DNN) approximations might have the capacity to indeed overcome the CoD in the sense that the number of real parameters used to describe the approximating DNNs grows at most polynomially in both the PDE dimension and the reciprocal of the prescribed approximation accuracy. In this talk, we show that solutions of suitable Kolmogorov PDEs can be approximated by DNNs without the CoD. 
Diyora Salimova 🔗 
Sat 6:40 a.m.  7:20 a.m.

Reinforcement learning in continuoustime and space
(
Invited talk
)
SlidesLive Video In this talk, we will introduce a continuoustime reinforcement learning (CTRL) framework. Our talk starts with a categorization of RL problems and naturally motivates a continuoustime perspective to RL. We then introduce a modelbased CTRL approach, which solves physical control tasks using neural ordinary differential equations as a subroutine. We conclude by briefly introducing recent approaches to CTRL. 
Cagatay Yildiz 🔗 
Sat 7:20 a.m.  7:40 a.m.

Break
(
Break
)

🔗 
Sat 7:40 a.m.  8:20 a.m.

Generative Modeling with Stochastic Differential Equations
(
Invited talk
)
SlidesLive Video Generative models are typically based on explicit representations of probability distributions (e.g., autoregressive or VAEs) or implicit sampling procedures (e.g., GANs). We propose an alternative approach based on modeling directly the vector field of gradients of the data distribution (scores). Our framework allows flexible architectures, requires no sampling during training or the use of adversarial training methods. Additionally, scorebased generative models enable exact likelihood evaluation through connections with continuous time normalizing flows and stochastic differential equations. We produce samples comparable to GANs, achieving new stateoftheart inception scores, and excellent likelihoods on image datasets. 
Stefano Ermon 🔗 
Sat 8:20 a.m.  8:30 a.m.

Continuoustime eventbased GRU for activitysparse inference and learning
(
Contributed talk
)
SlidesLive Video The scalability of recurrent neural networks (RNNs) is hindered by the sequential dependence of each time step’s computation on the previous time step’s output. Therefore, one way to speed up and scale RNNs is to reduce the computation required at each time step independent of model size and task. In this paper, we propose a timecontinuous eventbased model (EGRU) that extends Gated Recurrent Units (GRU) with an eventgeneration mechanism. This mechanism enforces activitysparsity in time, and allows our model’s units to compute updates only on receipt of input events from other units. The combination of activitysparsity and eventbased computation has the potential to be computationally vastly more efficient than current RNNs. Notably, activitysparsity in our model also translates into sparse parameter updates during gradient descent, extending this compute efficiency to the training phase. This sets the stage for the next generation of recurrent networks that are more scalable and efficient. 
Mark Schoene · Anand Subramoney · David Kappel · Khaleelulla Khan Nazeer · Christian Mayr 🔗 
Sat 8:30 a.m.  8:40 a.m.

IrregularlySampled Time Series Modeling with Spline Networks
(
Contributed talk
)
SlidesLive Video Observations made in continuous time are often irregular and contain the missing values across different channels. One approach to handle the missing data is imputing it using splines, by fitting the piecewise polynomials to the observed values. We propose using the splines as an input to a neural network, in particular, applying the transformations on the interpolating function directly, instead of sampling the points on a grid. To do that, we design the layers that can operate on splines and which are analogous to their discrete counterparts. This allows us to represent the irregular sequence compactly and use this representation in the downstream tasks such as classification and forecasting. Our model offers competitive performance compared to the existing methods both in terms of the accuracy and computation efficiency. 
Marin Biloš · Emanuel Ramneantu · Stephan Günnemann 🔗 
Sat 8:40 a.m.  8:50 a.m.

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence toMirror Descent
(
Contributed talk
)
SlidesLive Video As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related mirror map. Conversely, continuous mirror descent with any mirror map can be viewed as gradient flow with a related commuting parametrization. The latter result relies upon Nash's embedding theorem. 
Zhiyuan Li · Tianhao Wang · Jason Lee · Sanjeev Arora 🔗 
Sat 8:50 a.m.  9:00 a.m.

Heat Diffusion Based Recurrent Neural Differential Equations
(
Contributed talk
)
SlidesLive Video Recurrent neural networks (RNN) are the primary choice for modelling sequential data, however they are less suitable for modelling irregular timeseries data. Continuous time variants of RNN using neural ordinary differential equations (NODE) were shown to perform well on irregular time series data. They learn a better representation of the data using the continuous transformation of hidden states over time, taking into account the time interval between the observations. However, they are still limited in their capability as they use discrete number of layers (depth) over an input in the sequence to produce the output observation. We intend to address this limitation by proposing a RNN model designed based on the principle of heat equation. Our heat diffusion based recurrent neural differential equations(HDRNDE) model generalizes RNN models by continuously evolving the hidden states in the temporal and depth dimension. HDRNDE model is based on partial differential equations which treats the computation of hidden states as solving a heat equation over time. We demonstrate the effectiveness of the proposed model by comparing against the stateoftheart RNN models on real world sequence modeling data sets. 
srinivas anumasa · geetakrishnasai gunapati · Srijith Prabhakaran nair kusumam 🔗 
Sat 9:00 a.m.  10:30 a.m.

Lunch break
(
Break
)

🔗 
Sat 10:30 a.m.  11:10 a.m.

ResNet after all? How (not) to design continuous neural network architectures
(
Invited talk
)
SlidesLive Video Can Neural ODE architectures provide a continuoustime extension of residual neural networks? I will show that this depends on the specific numerical solver chosen for training Neural ODE models. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. But if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the continuousintime concept. I will examine the specific effects which lead to this breakdown and discuss how to ensure that the model maintains continuoustime properties. 
Katharina Ott 🔗 
Sat 11:10 a.m.  11:50 a.m.

Continuous vs. Discrete Optimization of Deep Neural Networks
(
Invited talk
)
SlidesLive Video Existing analyses of optimization in deep learning are either continuous, focusing on variants of gradient flow (GF), or discrete, directly treating variants of gradient descent (GD). GF is amenable to theoretical analysis, but is stylized and disregards computational efficiency. The extent to which it represents GD is an open question in deep learning theory. My talk will present a recent study of this question. Viewing GD as an approximate numerical solution to the initial value problem of GF, I will show that the degree of approximation depends on the curvature around the GF trajectory, and that over deep neural networks (NNs) with homogeneous activations, GF trajectories enjoy favorable curvature, suggesting they are well approximated by GD. I will then use this finding to translate an analysis of GF over deep linear NNs into a guarantee that GD efficiently converges to global minimum almost surely under random initialization. Finally, I will present experiments suggesting that over simple deep NNs, GD with conventional step size is indeed close to GF. An underlying theme of the talk will be the possibility of GF (or modifications thereof) to unravel mysteries behind deep learning. 
Nadav Cohen 🔗 
Sat 11:50 a.m.  12:00 p.m.

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
(
Contributed talk
)
SlidesLive Video Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common largescaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings. 
Sadhika Malladi · Kaifeng Lyu · Abhishek Panigrahi · Sanjeev Arora 🔗 
Sat 12:00 p.m.  12:30 p.m.

Tea Break
(
Break
)

🔗 
Sat 12:30 p.m.  1:30 p.m.

Panel
(
Discussion Panel
)
SlidesLive Video A great panel discussion on continuous time methods in ML. Panel moderator: Michael N. Arbel, Research Fellow at the THOTH team of INRIA Grenoble. Panelists: Tatjana Chavdarova, Postdoctoral Fellow, UC Berkeley Ricky Chen, Research Scientist, Meta Priya Donti, PhD student, CMU Adil Salim, Research Scientist, Microsoft Research 
🔗 
Sat 1:30 p.m.  3:00 p.m.

Social and Poster session
(
Social and poster
)

🔗 


Markovian Gaussian Process Autoencoders
(
Spotlight
)
SlidesLive Video Deep generative models are widely used for modelling highdimensional time series, such as video animations, audio and climate data. Sequential variational autoencoders have been successfully considered for many applications, with many variant models relying on discretetime methods and recurrent neural networks (RNNs). On the other hand, continuoustime methods have recently gained attraction, especially in the context of irregularlysampled time series, where they can better handle the data than discretetime methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GPs), allowing inductive biases to be explicitly encoded via the kernel function and interpretability of the latent space. However, a major limitation of GPVAEs is that it inherits the same cubic computational cost as GPs. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable a lineartime GP solver via Kalman filtering and smoothing. We show via corrupt and missing frames tasks that our method performs favourably, especially on the latter where it outperforms RNNbased models. 
Harrison Zhu · Carles Balsells Rodas · Yingzhen Li 🔗 


Contrasting Discrete and Continuous Time Methods for Bayesian System Identification
(
Spotlight
)
SlidesLive Video In recent years, there has been considerable interest in embedding continuous time methods in machine learning algorithms. In system identification, the task is to learn a dynamical model from incomplete observation data, and when prior knowledge is in continuous time  for example, mechanistic differential equation models  it seems natural to use continuous time models for learning. Yet when learning flexible, nonlinear, probabilistic dynamics models, most previous work has focused on discrete time models to avoid computational, numerical, and mathematical difficulties. In this work we show, with the aid of smallscale examples, that this mismatch between model and data generating process can be consequential under certain circumstances, and we discuss possible modifications to discrete time models which may better suit them to handling data generated by continuous time processes. 
Talay Cheema · Carl E Rasmussen 🔗 


A Multistep FrankWolfe Method
(
Spotlight
)
SlidesLive Video
The FrankWolfe algorithm has regained much interest in its use in structurally constrained machine learning applications. However, one major limitation of the FrankWolfe algorithm is the slow local convergence property due to the zigzagging behavior. We observe the zigzagging phenomenon in the FrankWolfe method as an artifact of discretization, and propose multistep FrankWolfe variants where the truncation errors decay as $O(\Delta^p)$, where $p$ is the method's order. This strategy "stabilizes" the method, and allows tools like line search and momentum to have more benefit. However, our results suggest that the worst case convergence rate of RungeKuttatype discretization schemes cannot improve upon that of the vanilla FrankWolfe method for a rate depending on $k$. Still, we believe that this analysis adds to the growing knowledge of flow analysis for optimization methods, and is a cautionary tale on the ultimate usefulness of multistep methods.

zhaoyue chen · Yifan Sun 🔗 


Everyone Matters: Customizing the Dynamics of Decision Boundary for Adversarial Robustness
(
Spotlight
)
SlidesLive Video The adversarial robustness of a deep classifier can be measured by the robust radii: the decision boundary's distances to natural data points. However, it is unclear whether current adversarial training (AT) methods effectively improves the robust radius for each individual vulnerable point. To understand this, we propose a continuoustime framework that studies the relative speed of the decision boundary with respect to each individual point. Through visualizing the speed, a surprising conflicting movingbehavior is revealed: the decision boundary under AT moves away from some vulnerable points but simultaneously moves closer to other vulnerable ones. To alleviate this conflicting dynamics of the decision boundary, we propose Dynamical Customized Adversarial Training (DynaCAT) which directly controls the decision boundary to move away from the training data points. Moreover, in order to further encourage the robustness improvement for more vulnerable points, DynaCAT controls the decision boundary to move faster away from points with smaller robust radii, achieving customized manipulation of the decision boundary. As a result, DynaCAT achieves fairer robustness to individuals, leading to better overall robustness under limited model capacity. Experiments verify that DynaCAT alleviates the conflicting dynamics and obtains improved robustness compared with the stateoftheart defenses. 
Yuancheng Xu · Yanchao Sun · Furong Huang 🔗 


Accelerated Methods for Distributed Optimization Problems using Fixedtime Stability of Continuoustime Dynamical Systems
(
Spotlight
)
SlidesLive Video In this workshop paper, we present the recent developments on accelerated methods for solving constrained optimization problems using the notion of Fixedtime Stability (FxTS) utilizing the paradigm of continuoustime dynamical system. The notion of FxTS was first introduced in the field of control theory for studying fast convergence of trajectories of dynamical systems to their equilibrium point. We discuss how this concept can be used for optimization problems to solve them faster than the SOTA algorithms in distributed setting. 
Kunal Garg · Mayank Baranwal 🔗 


Faster Training of Neural ODEs Using Gauß–Legendre Quadrature
(
Spotlight
)
SlidesLive Video Neural ODEs demonstrate strong performance in generative and timeseries modelling. However, training them via the adjoint method is slow compared to discrete models due to the requirement of numerically solving ODEs. To speed neural ODEs up, a common approach is to regularise the solutions. However, this approach may affect the expressivity of the model; when the trajectory itself matters, this is particularly important. In this paper, we propose an alternative way to speed up the training of neural ODEs. The key idea is to speed up the adjoint method by using Gauß–Legendre quadrature to solve integrals faster than ODEbased methods while remaining memory efficient. Our approach leads to faster training of neural ODEs, especially for large models. 
Alexander Norcliffe · Marc Deisenroth 🔗 


Nonconvex online learning via algorithmic equivalence
(
Spotlight
)
SlidesLive Video
We study an algorithmic equivalence technique between nonconvex gradient descent and convex mirror descent. We start by looking at a harder problem of regret minimization in online nonconvex optimization. We show that under certain geometric and smoothness conditions, online gradient descent applied to nonconvex functions is an approximation of online mirror descent applied to convex functions under reparameterization. In continuous time, the gradient flow with this reparameterization was shown to be exactly equivalent to continuoustime mirror descent by Amid and Warmuth, but theory for the analogous discrete time algorithms is left as an open problem. We prove an $O(T^{\frac{2}{3}})$ regret bound for nonconvex online gradient descent in this setting, answering this open problem. Our analysis is based on a new and simple algorithmic equivalence method.

Udaya Ghai · Zhou Lu · Elad Hazan 🔗 


Gradient Flows for L2 Support Vector Machine Training
(
Spotlight
)
SlidesLive Video We explore the merits of training of support vector machines for binary classification by means of solving systems of ordinary differential equations. We thus assume a continuous time perspective on a machine learning problem which may be of interest for implementations on (re)emerging hardware platforms such as analog or quantum computers. 
Christian Bauckhage · Rafet Sifa · Helen Schneider · Benjamin Wulff 🔗 


Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges
(
Spotlight
)
We propose a new framework to reconstruct a stochastic process $\left\{\mathbb{P}_{t}: t \in[0, T]\right\}$ using only samples from its marginal distributions, observed at start and end times 0 and T. This reconstruction is useful to infer population dynamics, a crucial challenge, e.g., when modeling the timeevolution of cell populations from singlecell sequencing data. Our general framework encompasses the more specific Schrödinger bridge (SB) problem, where $\mathbb{P}_{t}$ represents the evolution of a thermodynamic system at almost equilibrium. Estimating such bridges from scratch is notoriously difficult, motivating our proposal for a novel adaptive scheme called the GSBflow. Our approach is to first perform a Gaussian approximation of the general SB via matching the moments of the data, which proves to significantly stabilize the training of SB. To that end, we solve the SB problem with Gaussian marginals, for which we provide, as a central contribution, a closedform solution, and SDE representation. We use these formulas to define the reference process used to estimate more complex SBs, and obtain notable numerical improvements when reconstructing both synthetic processes and singlecell genomics.

YaPing Hsieh · Charlotte Bunne · Marco Cuturi · Andreas Krause 🔗 


Modeling Solutions to Ordinary and Partial Differential Equations with Continuous Initial Value Networks
(
Spotlight
)
SlidesLive Video Differential equations play an important role in many different domains as they are used to describe the change in various real world systems. Previous works combined neural networks with differential equations to specify the dynamic or learn the solution. In this paper, we propose a general framework for modeling the solutions to ordinary and partial differential equations which relies on satisfying certain requirements so that the learned model always corresponds to the solution of the target equation. In particular, we propose novel flow models based on an efficient matrix exponential transformation to model ODE solutions. We extend this to stochastic differential equations and discuss suitable training strategies. Finally, we design models that are solutions to PDEs while respecting the initial and boundary conditions. Our models can be used in physicsinformed learning, as well as to learn the mappings between the function spaces by defining a neural operator. Throughout the experiments, we demonstrate the benefits of using our method both in terms of predictive and computational performance. 
Marin Biloš · Andrei Smirdin · Stephan Günnemann 🔗 


EpsilonGreedy Reinforcement Learning Policy in ContinuousTime Systems
(
Spotlight
)
SlidesLive Video This work studies theoretical performance guarantees of a ubiquitous reinforcement learning policy for a canonicalcontinuoustime model. We show that epsilonGreedy addresses the explorationexploitation dilemma forminimizing quadratic costs in linear dynamical systems that evolve according to stochastic differential equations.More precisely, we establish squareroot of time regret bounds, indicating that epsilonGreedy learns optimalcontrol actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parametersis shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamentalchallenges of continuoustime reinforcement learning. 
Mohamad Kazem Shirani Faradonbeh 🔗 


Temporal Graph Neural Networks with TimeContinuous Latent States
(
Spotlight
)
SlidesLive Video We propose a temporal graph neural network model for graphstructured irregular time series. The model is designed to handle both irregular time steps and partial graph observations. This is achieved by introducing a timecontinuous latent state in each node of the graph. The latent dynamics are defined using a statedependent decaymechanism. Observations in the graph neighborhood are taken into account by integrating graph neural network layers in both the state update and predictive model. Experiments on a traffic forecasting task validate the usefulness of both the graph structure and timecontinuous dynamics in this setting. 
Joel Oskarsson · Per Sidén · Fredrik Lindsten 🔗 


Continuous Methods : Adaptively intrusive reduced order model closure
(
Spotlight
)
SlidesLive Video Reduced order modeling methods are often used as a means to reduce simulation costs in industrial applications. Despite their computational advantages, reduced order models (ROMs) often fail to accurately reproduce complex dynamics encountered in real life applications. To address this challenge, we leverage NeuralODEs to propose a novel ROM correction approach based on a timecontinuous memory formulation. Finally, experimental results show that our proposed method provides a high level of accuracy while retaining the low computational costs inherent to reduced models. 
Emmanuel Menier · Michele Alessandro Bucci · Mouadh Yagoubi · Lionel Mathelin · Raphael Meunier · Thibault Dairay · Marc Schoenauer 🔗 


Continuous Methods : Hamiltonian Domain Translation
(
Spotlight
)
SlidesLive Video This paper proposes a novel approach to domain translation. Leveraging established parallels between generative models and dynamical systems, we propose a reformulation of the CycleGAN architecture. By embedding our model with a Hamiltonian structure, we obtain a continuous, expressive and most importantly invertible generative model for domain translation. 
Emmanuel Menier · Michele Alessandro Bucci · Mouadh Yagoubi · Lionel Mathelin · Marc Schoenauer 🔗 


When Neural ODE Meets Adaptive Moment Estimation: Boosting Efficiency, Stability and Accuracy of Neural ODEs Together
(
Spotlight
)
SlidesLive Video Recent work by Xia et al. leveraged the continuouslimit of the classical momentum accelerated gradient descent and proposed heavyball neural ODEs. While this model offers computational efficiency and high utility over vanilla neural ODEs, this approach often causes the overshooting of internal dynamics, leading to unstable training of a model. Prior work addresses this issue by using adhoc approaches, e.g., bounding the internal dynamics using specific activation functions, but the resulting models do not satisfy the exact heavyball ODE. In this work, we propose adaptive momentum estimation neural ODEs (AdamNODEs) that adaptively control the acceleration of the classical momentumbased approach. We find that We find that its adjoint states also satisfy AdamODE and do not require adhoc solutions that the prior work employs. In evaluation, we show that AdamNODEs achieve the lowest training loss and efficacy over existing neural ODEs. We also show that AdamNODEs have better training stability than classical momentumbased neural ODEs. This result sheds some light on adapting the techniques proposed in the optimization community to improving the training and inference of neural ODEs further. 
Seunghyeon Cho · Sanghyun Hong · Kookjin Lee · Noseong Park 🔗 


TwoTimescale Stochastic Approximation for Bilevel Optimisation Problems in ContinuousTime Models
(
Spotlight
)
SlidesLive Video We analyse the asymptotic properties of a continuoustime, twotimescale stochastic approximation algorithm designed for stochastic bilevel optimisation problems in continuoustime models. We obtain the weak convergence rate of this algorithm in the form of a central limit theorem. We also demonstrate how this algorithm can be applied to several continuoustime bilevel optimisation problems. 
Louis Sharrock 🔗 


A New Look on Diffusion Times for Scorebased Generative Models
(
Spotlight
)
SlidesLive Video
Scorebased diffusion models map noise into data using stochastic differential equations. While current practice advocates for a large $T$ to ensure closeness to steady state, a smaller value of $T$ should be preferred for a better approximation of the scorematching objective and computational efficiency. We conjecture, contrary to current belief and corroborated by numerical evidence, that the optimal diffusion times are smaller than current practice.

Giulio Franzese · Simone Rossi · Lixuan YANG · alessandro finamore · Dario Rossi · Maurizio Filippone · Pietro Michiardi 🔗 


Towards a General Purpose CNN for Long Range Dependencies in $N$D
(
Spotlight
)
SlidesLive Video
The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework. However, performant CNN architectures must be tailored to specific tasks in order to incorporate considerations such as the input length, resolution, and dimentionality. In this work, we overcome the need for problemspecific CNN architectures with our Continuous Convolutional Neural Network (CCNN): a single CNN architecture equipped with continuous convolutional kernels that can be used for tasks on data of arbitrary resolution, dimensionality and length without structural changes. Continuous convolutional kernels model long range dependencies at every layer, and remove the need for downsampling layers and taskdependent depths needed in current CNN architectures. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential ($1D$) and visual data ($2D$). Our CCNN performs competitively and often outperforms the current stateoftheart across all tasks considered.

David Romero · David Knigge · Albert Gu · Erik Bekkers · Efstratios Gavves · Jakub Tomczak · Mark Hoogendoorn 🔗 


Learning to Discretize for Continuoustime Sequence Compression
(
Spotlight
)
Neural compression offers a domainagnostic approach to creating codecs for lossy or lossless compression via deep generative models. For sequence compression, however, most deep sequence models have costs that scale with the sequence length rather than the sequence complexity. In this work, we instead treat data sequences as observations from an underlying continuoustime process and learn how to efficiently discretize while retaining information about the full sequence. As a consequence of decoupling sequential information from its temporal discretization, our approach allows for greater compression rates and smaller computational complexity. Moreover, the continuoustime approach naturally allows us to decode at different time intervals and is amenable to randomly missing data, an important property for streaming applications. We empirically verify our approach on multiple domains involving compression of video and motion capture sequences, showing that our approaches can automatically achieve significant reductions in bit rates. 
Ricky T. Q. Chen · Maximilian Nickel · Matthew Le · Matthew Muckley · Karen Ullrich 🔗 


The Gap Between Continuous and Discrete Gradient Descent
(
Spotlight
)
SlidesLive Video While it is possible to obtain valuable insights by analyzing gradient descent (GD) in its continuous form, we argue that a complete understanding of the mechanics leading to GD's success may indeed require considering effects of using a large step size in the discrete regime. To support this claim, we demonstrate the difference in trajectories for small and large learning rates when GD is applied on a neural network, observing effects of an escape from a local minimum with a large step size. Furthermore, it has been widely observed in neural network training that when applying stochastic gradient descent (SGD), a large step size is essential for obtaining superior models. In this work, through a novel set of experiments, we show even though stochastic noise is beneficial, it is not enough to explain success of SGD and a large learning rate is essential for obtaining the best performance even in stochastic settings. Finally, we prove on a certain class of functions that GD with large step size follows a different trajectory than GD with a small step size which can facilitate convergence to the global minimum. 
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich 🔗 


Principle of Least Action Approach to Accelerate Neural Ordinary Differential Equations
(
Spotlight
)
SlidesLive Video Neural ordinary differential equations(NODE) generalize discrete ResNet models by continuously transforming the hidden representations. NODE treats the computation of hidden states as computing the trajectory of an ordinary differential equation(ODE) parameterized by a neural network, which is expensive in terms of number of function evaluations. In this work, we propose a regularisation technique to decrease the number of function evaluations which is built on the framework of principle of least action (PLA) . In dynamics, the path chosen by an object to move from from one point to another is such that the action is minimum. Action is defined as the integral of the Lagrangian along the path. In our proposed approach, the trajectory computed by the NODE is controlled by a regularizer will be analogues to minimizing the action. We experimentally show that our proposed regularizer indeed requires less number of function evaluations. 
srinivas anumasa · Srijith Prabhakaran nair kusumam 🔗 


Estimating Treatment Effects in Continuous Time with Hidden Confounders
(
Spotlight
)
SlidesLive Video Estimating individual treatment effects (ITEs) plays a crucial role in many realworld applications involving policy analysis and decision making. Nevertheless, estimating treatment effects in the longitudinal setting in the presence of hidden confounders remains an extremely challenging problem. Recently, there is a growing body of work attempting to obtain unbiased ITE estimates from timedynamic observational data by ignoring the possible existence of hidden confounders. Additionally, many existing works handling hidden confounders are not applicable for continuoustime settings.In this paper, we extend the line of work focusing on deconfounding in the dynamic time setting in the presence of hidden confounders. We leverage recent advancements in neural differential equations to build a latent factor model using a stochastic controlled differential equation and Lipschitz constrained convolutional operation in order to continuously incorporate information about ongoing interventions and irregularly sampled observations. Experiments on both synthetic and realworld datasets highlight the promise of continuous time methods for estimating treatment effects in the presence of hidden confounders. 
Defu Cao · James Enouen · Yan Liu 🔗 


Continuoustime Analysis for Variational Inequalities: An Overview & Desiderata
(
Spotlight
)
SlidesLive Video The optimization of zerosum games, multiobjective agent training, or in general, the optimization of variational inequality (VI) problems is currently notoriously unstable on general problems. Owing to the increased need for training such models in machine learning, the above observation attracted significant research attention over the past years. Substantial progress has been made towards understanding the qualitative differences with singleobjective minimization by casting the optimization method in its corresponding continuoustime dynamics, as well as obtaining convergence guarantees and rates for some instances of VIs because such guarantees often guide the corresponding proof for the discrete counterpart. Most notably, continuoustime tools allowed for analyzing complex nonconvex problems, which in some cases, cannot be carried out using standard discretetime tools. This paper aims to provide an overview of these ideas specifically for the broad VI problem class, and the insights originating from applying continuoustime tools for VI problems. We finalize by describing various desiderata of fundamental open questions towards developing optimization methods that work for general VIs and argue that tackling these requires understanding the associated continuoustime dynamics. 
Tatjana Chavdarova · YaPing Hsieh 🔗 


MQTransformer: Context Dependent Attention and Bregman Volatility
(
Spotlight
)
SlidesLive Video In many forecasting applications (e.g. retail demand, electricity load, weather, finance, etc.), the forecasts must obey certain properties such as having certain contextdependent and timevarying seasonality patterns and avoiding excessive revision as new information becomes available. Here we propose a new forecasting neural net architecture that addresses some of these issues, MQTransformer, by incorporating three architectural improvements to the current stateoftheart: 1) a novel decoderencoder attention that aligns the historical and future time periods 2) a novel positional encoding that learns seasonality from the historical time series and 3) a novel decoderself attention that allows the network to minimize the forecast volatility. We then define a new measure of forecast volatility, Bregman Volatility, to understand one major source of the improvement from our model. Bregman Volatility allows us to compute the optimal volatility of a sequence of forecasts in terms of the improvement in forecast accuracy over that time period. We show both theoretically and empirically that the decoderself attention module optimizes Bregman volatility and thereby improves forecast accuracy as well. 
Carson Eisenach · Dhruv Madeka · Kevin Chen · Lee Dicker 🔗 


PhysicsInformed Neural Operator for Learning Partial Differential Equations
(
Spotlight
)
Machine learning methods have recently shown promise in solving partial differential equations (PDEs). They can be classified into two broad categories: approximating the solution function and learning the solution operator. The PhysicsInformed Neural Network (PINN) is an example of the former while the Fourier neural operator (FNO) is an example of the latter. Both these approaches have shortcomings. The optimization in PINN is challenging and prone to failure, especially on multiscale dynamic systems. FNO does not suffer from this optimization issue since it carries out supervised learning on a given dataset, but obtaining such data may be too expensive or infeasible. In this work, we propose the physicsinformed neural operator (PINO), where we combine the operatinglearning and functionoptimization frameworks. This integrated approach improves convergence rates and accuracy over both PINN and FNO models. In the operatorlearning phase, PINO learns the solution operator over multiple instances of the parametric PDE family. In the testtime optimization phase, PINO optimizes the pretrained operator ansatz for the querying instance of the PDE. Experiments show PINO outperforms previous ML methods on many popular PDE families while retaining the extraordinary speedup of FNO compared to solvers. In particular, PINO accurately solves long temporal transient flows and Kolmogorov flows where other baseline methods fail to converge. 
Zongyi Li · Hongkai Zheng · Nikola Kovachki · David Jin · Haoxuan Chen · Burigede Liu · Kamyar Azizzadenesheli · Animashree Anandkumar 🔗 


Riemannian Diffusion Schr\"odinger Bridge
(
Spotlight
)
SlidesLive Video Scorebased generative models exhibits state of art performance on density estimation and generative modeling tasks.These models typically assume that the data geometry is flat, yet recent extensions have been developed to model data living on Riemannian manifolds. Existing methods to accelerate sampling of diffusion models are typically not applicable in the Riemannian setting and Riemannian scorebased methods have not yet been adapted to the important task of interpolation of datasets. To overcome these issues, we introduce \emph{Riemannian Diffusion Schr\"odinger Bridge} (RDSB).Our proposed method generalizes Diffusion Schr\"odinger Bridge introduced in \cite{debortoli2021neurips} to the nonEuclidean setting and as such generalizes Riemannian scorebased models beyond the first time reversal. We validate our proposed method on synthetic data and real Earth and climate data. 
James Thornton · Valentin De Bortoli · Michael Hutchinson · Emile Mathieu · Yee Whye Teh · Arnaud Doucet 🔗 


Data Assimilation and Neural ODEs for learning latent dynamics
(
Spotlight
)
SlidesLive Video The development of datainformed predictive models for dynamical systems is of widespread interest in many disciplines.We present a unifying framework for blending mechanistic and machinelearning approachesto identify dynamical systems from noisily and partially observed timeseries data.Our formulation is agnostic to the chosen machine learning model,is presented in both continuous and discretetime settings,and is compatible both with systems that exhibit substantial memory and systems that are memoryless.We conclude with a series of numerical results thata) illustrate tradeoffs when learning dynamics in continuous and discretetime,and b) demonstrate the inference power of our methodology in a partially observed Lorenz '63 system. 
Matthew Levine · Andrew Stuart 🔗 


Connections between Kernel Analog Forecasting and Gaussian Process Regression
(
Spotlight
)
SlidesLive Video
In this short communication we expose connections between two datadriven machine learning methods, kernel analog forecasting (KAF) and Gaussian process regression (GPR). In particular, it is shown that there are three major points in which KAF differs from GPR: the use of a specific kernel, normalization that guarantees spectrum to lie in $(0, 1]$, and spectral truncation, which acts both as a computational speedup and regularization.

Dmitry Burov 🔗 


Identification of Hidden Clusters of Time Series with Hybrid Neural Networks Integrating Expert Models
(
Spotlight
)
SlidesLive Video Deep learningbased approaches for time series analysis notoriously suffer from interpretability and robustness issues due to their blackbox nature. In this work, we propose a hybrid neural network model with embedded expert knowledge. We assume the time series are generated by a finite set of dynamics with known functional form. Our experiments show that our approach is more interpretable, and better at reconstruction than its blackbox counterparts. 
András Formanek · Edward De Brouwer · Péter Antal · Yves Moreau · Adam Arany 🔗 


Should You Follow the Gradient Flow? Insights from RungeKutta Gradient Descent
(
Spotlight
)
SlidesLive Video Recently, it has become popular in the machine learning community to model gradientbased optimization algorithms as ordinary differential equations (ODEs). Moreover, stateoftheart optimizers such as SGD and Momentum can be recovered from the corresponding ODE using firstorder numerical integrators such as explicit and symplectic Euler methods. In contrast, very little theoretical and experimental investigation has been carried out on the properties of higherorder integrators in optimization. In this paper, we analyze the properties of highorder RungeKutta (RK) integrators on gradient flows, in the context of both convex optimization and deep learning. We show that, while RK provides a close approximation to the gradient flow, this induces an increase in sharpness (maximum Hessian eigenvalue) at the solution – a feature which is believed to be negatively correlated with generalization. In addition, we show that, while highorder RK descent methods are stable for a broad range of stepsizes, convergence speed (in terms of training loss) is usually negatively affected by the method order. 
Xiang Li · Antonio Orvieto 🔗 