Workshop
Sampling and Optimization in Discrete Space
Haoran Sun · Hanjun Dai · Priyank Jaini · Ruqi Zhang · Ellen Vitercik
Meeting Room 312
There have recently been new research trends in efficient discrete sampling and optimization. We are organizing this workshop with the goals of 1) syncing up on the latest research progress in discrete sampling and optimization, 2) discussing the limitations of current methods and brainstorming new algorithm paradigms, and 3) connecting to applications in domains such as language/protein modeling, physics simulation, and bio/chemical engineering---where improved techniques for sampling/optimization in discrete space could help---and exploring the gaps between the application's needs and the capabilities of existing methods. We hope this workshop will be an excellent opportunity for presenting and discussing new algorithms and applications with researchers and practitioners within or outside the domain of discrete sampling/optimization.
Schedule
Sat 12:00 p.m. - 12:15 p.m.
|
Opening Remarks
(
Opening Remarks
)
SlidesLive Video |
🔗 |
Sat 12:15 p.m. - 12:45 p.m.
|
Yoshua Bengio: GFlowNets for Bayesian Inference
(
Invited Talk
)
SlidesLive Video Generative flow networks (GFlowNets) are generative policies trained to sample proportionally to a given reward function. If the reward function is a prior distribution times a likelihood, then the GFlowNet learns to sample from the corresponding posterior. Unlike MCMC, a GFlowNet does not suffer from the problem of mixing between modes, but like RL methods, it needs an exploratory training policy in order to discover modes. This can be conveniently done without any kind of importance weighting because the training objectives for GFlowNets can all be correctly applied in an off-policy fashion without reweighting. One can view GFlowNets also as extensions of amortized variational inference with this off-policy advantage. We show how training the GFlowNet sampler also learns how to marginalize over the target distribution or part of it, at the same time as it learns to sample from it, which makes it possible to train amortized posterior predictives. Finally, we show examples of application of GFlowNets for Bayesian inference over causal graphs, discuss open problems and how scaling up such methodologies opens the door to system 2 deep learning to discover explanatory theories and form Bayesian predictors, with the approximation error asymptotically going to zero as we increase the size and training time of the neural network. |
🔗 |
Sat 12:45 p.m. - 1:15 p.m.
|
Giacomo Zanella
(
Invited Talk
)
SlidesLive Video |
🔗 |
Sat 1:15 p.m. - 1:45 p.m.
|
Contributed Talk 1
(
Contributed Talk
)
SlidesLive Video SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems (Outstanding) Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation (Oral) Tackling Provably Hard Representative Selection via Graph Neural Networks (Oral) |
🔗 |
Sat 1:45 p.m. - 2:15 p.m.
|
Stefanie Jegelka: Learning discrete optimization: Loss functions and graph neural networks
(
Invited Talk
)
SlidesLive Video |
🔗 |
Sat 2:15 p.m. - 2:30 p.m.
|
Coffee Break
(
Break
)
|
🔗 |
Sat 2:30 p.m. - 4:00 p.m.
|
Poster Session 1
(
Poster Session
)
|
🔗 |
Sat 4:00 p.m. - 4:30 p.m.
|
Will Grathwohl: Recent Applications of Gradients in Discrete Sampling
(
Invited Talk
)
SlidesLive Video Discrete sampling is a challenging and important problem. Despite much research, we still lack generic methods to sampling from discrete distribution without considerable knowledge of the structure of the target distribution. Conversely, in continuous settings much more successful generic methods exist. These methods exploit the gradients of the distribution's log-likelihood function to approximate the distribution's local structure which is used to parameterize fast-mixing markov transition kernels. A number of approaches have attempted to apply these methods to discrete problems with varying levels of success. Typically we create a related continuous distribution, sampling from this using continuous methods, and map these continuous samples back into the original discrete space. Recently, a new class of approaches has emerged which utilize gradient information in a different way. These approaches stay completely in the original discrete space but utilize gradient information to define markov transition kernels which propose discrete transitions. These approaches have shown to scale better and are widely applicable. In this talk I will discuss the development of these methods starting Gibbs-With-Gradients, further work improving or expanding upon these ideas, and new directions for further research. |
🔗 |
Sat 4:30 p.m. - 5:00 p.m.
|
Lianhui Qin: Differentiable and structured text reasoning
(
Invited Talk
)
SlidesLive Video Text reasoning and generation in practice often needs to meet complex objectives, integrate diverse contextual constraints, and ground in logical structures for consistency. Current large LMs can produce fluent text and follow human instructions, but they still struggle to effectively optimize toward specific objectives. The discrete nature of text poses one of the key challenges to the optimization. In this talk, I will present our work on optimizing text reasoning and generation with continuous and discrete methods. I will first introduce COLD, a unified energy-based framework that empowers any off-the-shelf LMs to reason with any objectives in a continuous space. This approach brings forward differentiable reasoning over discrete text, thus improving efficiency. Following this, I will discuss Maieutic prompting, a method that enhances the logical consistency of neural reasoning in a discrete space by integrating with logical structures. |
🔗 |
Sat 5:00 p.m. - 5:30 p.m.
|
Contributed Talk 2
(
Contributed Talk
)
SlidesLive Video Understanding prompt engineering does not require rethinking generalization (Outstanding) Topological Neural Discrete Representation Learning à la Kohonen (Oral) Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs (Oral) |
🔗 |
Sat 5:30 p.m. - 6:00 p.m.
|
Petar Veličković: The Melting Pot of Neural Algorithmic Reasoning
(
Invited Talk
)
SlidesLive Video With the eyes of the AI world pointed at the alignment of large language models, another revolution has been more silently---yet intensely---taking place: the algorithmic alignment of neural networks. After briefly surveying how we got here, I'll present some of the interesting 2023 works I've had the pleasure to co-author, many of which were presented at this year's ICML. |
🔗 |
Sat 6:00 p.m. - 6:15 p.m.
|
Closing Remarks
(
Closing Remark
)
SlidesLive Video |
🔗 |
Sat 6:15 p.m. - 7:45 p.m.
|
Poster Session 2
(
Poster Session
)
|
🔗 |
-
|
Differentiable Search of Evolutionary Trees
(
Poster
)
link
Inferring the most probable evolutionary tree given leaf nodes is an important problem in computational biology that reveals the evolutionary relationships between species. Due to the exponential growth in the number of possible tree topologies, finding the best tree in polynomial time becomes computationally infeasible. In this work, we propose a novel differentiable approach as an alternative to traditional heuristic-based combinatorial tree search methods in phylogeny. Our method provides a general framework that is applicable for many phylogenetic inference problems where the optimization can be performed based on a desired objective. We empirically evaluate our method using randomly generated trees of up to 128 leaves, with each node represented by a 256-length protein sequence. The optimization objective of our method is to find the most parsimonious tree (i.e., to minimize the total number of evolutionary changes in the tree). Our method exhibits promising convergence ($<1\$% error for trees up to 32 leaves, $<8\$% error up to 128 leaves, given only leaf node information), illustrating its potential in much broader phylogenetic inference problems and possible integration with end-to-end differentiable models.
|
Ramith Hettiarachchi · Sergey Ovchinnikov 🔗 |
-
|
Diffusion on the Probability Simplex
(
Poster
)
link
Diffusion models learn to reverse the progressive noising of a data distribution to create a generative model. However, the desired continuous nature of the noising process can be at odds with discrete data. To deal with this tension between continuous and discrete objects, we propose a method of performing diffusion on the probability simplex. Using the probability simplex naturally creates an interpretation where points correspond to categorical probability distributions. Our method uses the softmax function applied to an Ornstein-Unlenbeck Process, a well-known stochastic differential equation. We find that our methodology also naturally extends to include diffusion on the unit cube which has applications for bounded image generation. |
Griffin Floto · Thorsteinn Jonsson · Mihai Nica · Scott Sanner · Eric Zhu 🔗 |
-
|
DISCS: A Benchmark for Discrete Sampling
(
Poster
)
link
Sampling in discrete space, with critical applications in simulation and optimization, has recently aroused considerable attention from the significant advances in gradient-based approaches that exploits modern accelerators like GPUs. However, two key challenges seriously hinder the further research of discrete sampling. First, since there is no consensus for the experimental setting, the empirical results in different research papers are often not comparable. Secondly, implementing samplers and target distributions often require nontrivial amount of effort in terms of calibration, parallelism, and evaluation. To tackle these challenges, we propose \emph{DISCS} (DISCrete Sampling), a tailored package and benchmark that supports unified and efficient implementation and evaluations for discrete sampling from three types of tasks, namely the sampling for classical graphical models, combinatorial optimization, and energy based generative models. Throughout the comprehensive evaluations in \emph{DISCS}, we also learned new insights in terms of the scalability, the design principle of proposal distributions, and lessons for the adaptive sampling design. \emph{DISCS} implements representative discrete samplers in existing research works as baselines, and offers a simple interface that researchers can conveniently design new discrete samplers and compare with baselines in a calibrated setup directly |
Katayoon Goshvadi · Haoran Sun · Xingchao Liu · Azade Nova · Ruqi Zhang · Will Grathwohl · Dale Schuurmans · Hanjun Dai 🔗 |
-
|
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
(
Oral
)
link
Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to enforcing syntactic and semantic constraints on the outputs of LLMs, called sequential Monte Carlo (SMC) steering. The key idea is to specify language generation tasks as posterior inference problems in a class of discrete probabilistic sequence models, and replace standard decoding with sequential Monte Carlo inference. For a computational cost similar to that of beam search, SMC can steer LLMs to solve diverse tasks, including infilling, generation under syntactic constraints, and prompt intersection. To facilitate experimentation with SMC steering, we present a probabilistic programming library, LLaMPPL, for concisely specifying new generation tasks as language model probabilistic programs, and automating steering of LLaMA-family Transformers. |
Alexander Lew · Tan Zhi-Xuan · Gabriel Grand · Vikash Mansinghka 🔗 |
-
|
Constrained Sampling of Discrete Geometric Manifolds using Denoising Diffusion Probabilistic Models
(
Poster
)
link
Understanding the macroscopic characteristics of biological complexes demands precision and specificity in statistical ensemble modeling. One of the primary challenges in this domain lies in sampling from particular discrete subsets of the state-space, driven either by existing structural knowledge or specific areas of interest within the state-space.We propose a method that enables sampling from distributions that rigorously adhere to arbitrary sets of geometric constraints in Euclidean spaces. This is achieved by integrating a constraint projection operator within the well-regarded architecture of Denoising Diffusion Probabilistic Models, a framework founded in generative modeling and probabilistic inference.The significance of this work becomes apparent, for instance, in the context of deep learning-based drug design, where it is imperative to sample from the discrete structures of the solution space. |
Justin Diamond · Markus Lill 🔗 |
-
|
Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation
(
Oral
)
link
Graph-based diffusion models have shown promising results in terms of generating high-quality solutions to NP-complete (NPC) combinatorial optimization (CO) problems. However, those models are often inefficient in inference, due to the iterative evaluation nature of the denoising diffusion process. This paper proposes to use $\textit {progressive}$ distillation to speed up the inference by taking fewer steps (e.g., forecasting two steps ahead within a single step) during the denoising process. Our experimental results show that the progressively distilled model can perform inference $\textbf{16}$ times faster with only $\textbf{0.019}$% degradation in performance on the TSP-50 dataset.
|
Junwei Huang · Zhiqing Sun · Yiming Yang 🔗 |
-
|
SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems
(
Oral
)
link
Optimization problems with nonlinear cost functions and combinatorial constraints appear in many real-world applications but remain challenging to solve efficiently compared to their linear counterparts. To bridge this gap, we propose $\textbf{\emph{\texttt{SurCo}}}$ that learns linear $\underline{\text{Sur}}$rogate costs which can be used in existing $\underline{\text{Co}}$mbinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem. The surrogate costs are learned end-to-end with nonlinear loss by differentiating through the linear surrogate solver, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We propose three $\texttt{SurCo}$ variants: $\texttt{SurCo}-\texttt{zero}$ for individual nonlinear problems, $\texttt{SurCo}-\texttt{prior}$ for problem distributions, and $\texttt{SurCo}-\texttt{hybrid}$ to combine both distribution and problem-specific information. We give theoretical intuition motivating $\texttt{SurCo}$, and evaluate it empirically. Experiments show that $\texttt{SurCo}$ finds better solutions faster than state-of-the-art and domain expert approaches in real-world optimization problems such as embedding table sharding, inverse photonic design, and nonlinear route planning.
|
Aaron Ferber · Taoan Huang · Daochen Zha · Martin Schubert · Benoit Steiner · Bistra Dilkina · Yuandong Tian 🔗 |
-
|
Can LLMs Generate Random Numbers? Evaluating LLM Sampling in Controlled Domains
(
Poster
)
link
Practitioners frequently take multiple samples from large language models (LLMs) to explore the distribution of completions induced by a given prompt. While individual samples can give high-quality results for given tasks, collectively there are no guarantees of the distribution over these samples induced by the generating LLM. In this paper, we empirically evaluate LLMs’ capabilities as distributionsamplers. We identify core concepts and metrics underlying LLM-based sampling, including different sampling methodologies and prompting strategies. Using a set of controlled domains we evaluate the error and variance of the distributions induced by the LLM. We find that LLMs struggle to induce reasonable distributions over generated elements, suggesting that practitioners should more carefully consider the semantics and methodologies of sampling from LLMs. |
Aspen Hopkins · Alex Renda · Michael Carbin 🔗 |
-
|
Protein Design with Guided Discrete Diffusion
(
Poster
)
link
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to a therapeutic target under locality and liability constraints, with 97% expression rate and 25% binding rate in exploratory in vitro experiments. |
Nate Gruver · Samuel Stanton · Nathan Frey · Tim G. J. Rudner · Isidro Hotzel · Julien Lafrance-Vanasse · Arvind Rajpal · Kyunghyun Cho · Andrew Wilson 🔗 |
-
|
Efficient Location Sampling Algorithms for Road Networks
(
Poster
)
link
Many geographic information systems applications rely on data provided by user devices in the road network, including traffic monitoring, driving navigation, and road closure detection. The underlying signal is generally collected by sampling locations from user trajectories. The sampling process, though critical for various applications, has not been studied sufficiently in the literature. While the most natural way to sample a trajectory may be to use a frequency based algorithm, e.g., sampling locations every $x$ seconds, such a sampling strategy can be quite wasteful in resources (e.g., server-side processing, user battery) as well as stored user data. In this work, we conduct a horizontal study of various location sampling algorithms (based on frequency, road geography, reservoir sampling, etc.) on the road network of New York City and assess their trade-offs in terms of various metrics of interest, such as the size of the stored data and the induced quality of training for prediction tasks (e.g., predicting speeds).
|
Sara Ahmadian · Kostas Kollias · Ameya Velingker · Sreenivas Gollapudi · Vivek Kumar · Santhoshini Velusamy 🔗 |
-
|
Towards Accelerating Benders Decomposition via Reinforcement Learning Surrogate Models
(
Poster
)
link
Stochastic optimization (SO) attempts to offer optimal decisions in the presence of uncertainty. Often, the classical formulation of these problems becomes intractable due to (a) the number of scenarios required to capture the uncertainty and (b) the discrete nature of real-world planning problems. To overcome these tractability issues, practitioners turn to decomposition methods that divide the problem into smaller, more tractable sub-problems. The focal decomposition method of this paper is Benders decomposition (BD), which decomposes stochastic optimization problems on the basis of scenario independence. In this paper we propose a method of accelerating BD with the aid of a surrogate model in place of an $\mathcal{NP}$-hard integer master problem. Through the acceleration method we observe $30\%$ faster average convergence when compared to other accelerated BD implementations. We introduce a reinforcement learning agent as a surrogate and demonstrate how it can be used to solve a stochastic inventory management problem.
|
Stephen Mak · Kyle Mana · Parisa Zehtabi · Michael Cashmore · Daniele Magazzeni · Manuela Veloso 🔗 |
-
|
Understanding prompt engineering does not require rethinking generalization
(
Oral
)
link
Zero-shot learning in prompted visual-language models, the practice of crafting prompts to build classifiers without an explicit training process, shows an impressive performance in many settings. There also emerges a seemingly surprising fact: this method suffers relatively little from overfitting; i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer zero-shot), the approach still performs relatively well on held-out test data. In this paper, we show that we can explain such performance remarkably well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are \emph{remarkably} tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. Indeed, we show that we can therefore \emph{greedily} search over the prompt space in such a framework, improving upon training performance while retaining the same bound. Furthermore, the bound is remarkably suitable for model selection: the models with the best bound typically also have the best test performance. This work thus provides a substantial justification for the widespread use of ``prompt engineering,'' even if it seems as though such methods could overfit a training set. |
Victor Akinwande · Yiding Jiang · Dylan Sam · Zico Kolter 🔗 |
-
|
GFlowNets for Causal Discovery: an Overview
(
Poster
)
link
Causal relationships underpin modern science and our ability to reason. Automatically discovering useful causal relationships can greatly accelerate scientific progress and facilitate the creation of machines that can reason like we do. Traditionally, the dominant approaches to causal discovery are statistical, such as the PC algorithm. A new area of research is integrating recent advancement in machine learning with causal discovery. We focus on a series of recent work that leverages new algorithms in deep learning for causal discovery -- notably, generative flow networks (GFlowNets). We discuss the unique perspectives GFlowNets bring to causal discovery. |
Dragos Cristian Manta · Edward Hu · Yoshua Bengio 🔗 |
-
|
Topological Neural Discrete Representation Learning à la Kohonen
(
Oral
)
link
Unsupervised learning of discrete representations in neural networks (NNs) from continuous ones is essential for many modern applications. Vector Quantisation (VQ) has become popular for this, in particular in the context of generative models such as Variational Auto-Encoders (VAEs), where the exponential moving average-based VQ (EMA-VQ) algorithm is often used. Here we study an alternative VQ algorithm based on Kohonen's learning rule for the Self-Organising Map (KSOM; 1982), a classic VQ algorithm known to offer two potential benefits over its special case EMA-VQ: empirically, KSOM converges faster than EMA-VQ, and KSOM-generated discrete representations form a topological structure on the grid whose nodes are the discrete symbols, resulting in an artificial version of the brain's topographic map. We revisit these properties by using KSOM in VQ-VAEs for image processing. In our experiments, the speed-up compared to well-configured EMA-VQ is only observable at the beginning of training, but KSOM is generally much more robust, e.g., w.r.t. the choice of initialisation schemes. |
Kazuki Irie · Robert Cordas · Jürgen Schmidhuber 🔗 |
-
|
Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information
(
Poster
)
link
Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. The optimizer can be trained with supervision from known optimal solutions or implicitly by optimizing the compound function $f\circ \mathbf{g}$. The implicit approach may not require optimal solutions as labels and is capable of handling problem uncertainty; however, it is slow to train and deploy due to frequent calls to optimizer $\mathbf{g}$ during both training and testing. The training is further challenged by sparse gradients of $\mathbf{g}$, especially for combinatorial solvers. To address these challenges, we propose using a smooth and learnable **Landscape Surrogate** $\mathcal{M}$ as a replacement for $f\circ \mathbf{g}$. This surrogate, learnable by neural networks, can be computed faster than the solver~$\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems, including shortest path and multidimensional knapsack, and real-world problems such as portfolio optimization, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.
|
Arman Zharmagambetov · Brandon Amos · Aaron Ferber · Taoan Huang · Bistra Dilkina · Yuandong Tian 🔗 |
-
|
Efficient data selection employing Semantic Similarity-based Graph Structures for model training
(
Poster
)
link
Recent developments in Natural Language Processing (NLP) have highlighted the need for substantial amounts of data for models to capture textual information accurately. This raises concerns regarding the computational resources and time required for training such models. The paper introduces SEmantics for data SAliency in Model performance Estimation (SeSaME). It is an efficient data sampling mechanism that is solely based on textual information without passing the data through a compute-heavy model or other intensive pre-processing transformations. The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models, which excessively rely on Text-to-Speech (TTS) calls when using augmented data. SeSaME learns to categorize new incoming data points into speech recognition difficulty buckets by employing semantic similarity-based graph structures and discrete ASR information from homophilous neighbourhoods through message passing. The results indicate reliable projections of ASR performance, with a $93\$% accuracy increase when using the proposed method compared to random predictions, bringing non-trivial information on the impact of textual representations in speech models. Furthermore, a series of experiments show both the benefits and challenges of using the ASR information on incoming data to fine-tune the model. We report a $7\$% drop in validation loss compared to random sampling, $7\$% WER drop with non-local aggregation when evaluating against a highly difficult dataset, and $1.8\$% WER drop with local aggregation and high semantic similarity between datasets.
|
Roxana Petcu · Subhadeep Maji 🔗 |
-
|
Tackling Provably Hard Representative Selection viaGraph Neural Networks
(
Oral
)
link
Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset.In this paper, we study RS for unlabeled datasets and focus on finding representatives that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model.We then study a setting where additional information in the form of a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RSGNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RSGNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RSGNN achieves significant improvements over established baselines on a suite of eight benchmarks. |
Mehran Kazemi · Anton Tsitsulin · Hossein Esfandiari · Mohammad Hossein Bateni · Deepak Ramachandran · Bryan Perozzi · Vahab Mirrokni 🔗 |
-
|
Training Discrete EBMs with Energy Discrepancy
(
Poster
)
link
Training energy-based models (EBMs) on discrete spaces is challenging because sampling over such spaces can be difficult. We propose to train discrete EBMs with energy discrepancy (ED), a novel type of contrastive loss functional which only requires the evaluation of the energy function at data points and their perturbed counter parts, thus not relying on sampling strategies like Markov chain Monte Carlo (MCMC). Energy discrepancy offers theoretical guarantees for a broad class of perturbation processes of which we investigate three types: perturbations based on Bernoulli noise, based on deterministic transforms, and based on neighbourhood structures. We demonstrate their relative performance on lattice Ising models, binary synthetic data, and discrete image data sets. |
Tobias Schröder · Zijing Ou · Yingzhen Li · Andrew Duncan 🔗 |
-
|
Finite-state Offline Reinforcement Learning with Moment-based Bayesian Epistemic and Aleatoric Uncertainties
(
Poster
)
link
Reinforcement learning (RL) agents can learn complex sequential decision-making and control strategies, often above human expert performance levels. In real-world deployment, it becomes essential from a risk, safety-critical, and human interaction perspective for agents to communicate the degree of confidence or uncertainty they have in the outcomes of their actions and account for it in their decision-making. We assemble here a complete pipeline for modelling uncertainty in the finite, discrete-state setting of offline RL.First, we use methods from Bayesian RL to capture the posterior uncertainty in environment model parameters given the available data. Next, we determine exact values for the return distribution's standard deviation, taken as the measure of uncertainty, for given samples from the environment posterior (without requiring quantile-based or similar approximations of conventional distributional RL) to more efficiently decompose the agent's uncertainty into epistemic and aleatoric uncertainties compared to previous approaches.This allows us to build an RL agent that quantifies both types of uncertainty and utilises its epistemic uncertainty belief to inform its optimal policy through a novel stochastic gradient-based optimisation process.We illustrate the improved uncertainty quantification and Bayesian value optimisation performance of our agent in simple, interpretable gridworlds and confirm its scalability by applying it to a clinical decision support system (AI Clinician) which makes real-time recommendations for sepsis treatment in intensive care units, and address the limitations that arise with inference for larger-scale MDPs by proposing a sparse, conservative dynamics model. |
Filippo Valdettaro · Aldo Faisal 🔗 |
-
|
Annealed Biological Sequence Optimization
(
Poster
)
link
Designing biological sequences with desired properties is an impactful research problem with various application scenarios such as protein engineering, anti-body design, and drug discovery. Machine learning algorithms could be applied either to fit the property landscape with supervised learning or generatively propose reasonable candidates to reduce wet lab efforts. From the learning perspective, the key challenges lie in the sharp property landscape, i.e. several mutations could dramatically change the protein property and the large biological sequence space. In this paper, we propose annealed sequence optimization (ANSO) and aim to simultaneously take the two main challenges into account by a paired surrogate model training paradigm and sequence sampling procedure. The extensive experiments on a series of protein sequence design tasks have demonstrated the effectiveness over several advanced baselines. |
Yuxuan Song · Botian Wang · Hao Zhou · Wei-Ying Ma 🔗 |
-
|
Hierarchical Decomposition Framework for Feasibility-hard Combinatorial Optimization
(
Poster
)
link
Combinatorial optimization (CO) is a widely-applied method for addressing a variety of real-world optimization problems. However, due to the NP-hard nature of these problems, complex problem-specific heuristics are often required to tackle them at real-world scales. Neural combinatorial optimization has emerged as an effective approach to tackle CO problems, but it often requires the pre-computed optimal solution or a hand-designed process to ensure the model to generate a feasible solution, which may not be available in many real-world CO problems. We propose the hierarchical combinatorial optimizer (HCO) that does not rely on such restrictive assumptions. HCO decomposes the given CO problem into multiple sub-problems on different scales with smaller search spaces, where each sub-problem can be optimized separately and their solutions can be combined to compose the entire solution. Our experiments demonstrate that this hierarchical decomposition facilitates more efficient learning and stronger generalization capabilities, outperforming traditional heuristic and mathematical optimization algorithms. |
Hanbum Ko · Minu Kim · Han-Seul Jeong · Sunghoon Hong · Deunsol Yoon · Youngjoon Park · Woohyung Lim · Honglak Lee · Moontae Lee · Kanghoon Lee · Sungbin Lim · Sungryull Sohn
|
-
|
Global Optimality in Bivariate Gradient-based DAG Learning
(
Poster
)
link
Recently, a new class of non-convex optimization problems motivated by the statistical problem of learning an acyclic directed graphical model from data has attracted significant interest. While existing work uses standard first-order optimization schemes to solve this problem, proving the global optimality of such approaches has proven elusive. The difficulty lies in the fact that unlike other non-convex problems in the literature, this problem is not "benign", and possesses multiple spurious solutions that standard approaches can easily get trapped in. In this paper, we prove that a simple path-following optimization scheme globally converges to the global minimum of the population loss in the bivariate setting. |
Chang Deng · Kevin Bello · Pradeep Ravikumar · Bryon Aragam 🔗 |
-
|
Strictly Low Rank Constraint Optimization \\ --- An Asymptotically $\mathcal{O}(\frac{1}{t^2})$ Method
(
Poster
)
link
We study a class of non-convex and non-smooth problems with rank regularization to induce sparsity in solutions. We propose to apply the proximal gradient descent method to solve the problem and accelerate the process with a novel support set projection operation operated on the singular values of the obtained solution. We show that our algorithms are able to achieve a convergence rate of $O(\frac{1}{t^2})$, which is Nesterov's optimal convergence rate of first-order methods on smooth and convex problems, and the same convergence rate of regular accelerated proximal gradient descent method on convex problems. Also, the obtained solutions all have great sparsity and the support set of singular values of each obtained solution is shrinking during the update, which improves the interpretability of the algorithms.
|
Mengyuan Zhang · Kai Liu 🔗 |
-
|
Searching Large Neighborhoods for Integer Linear Programs with Contrastive Learning
(
Poster
)
link
Integer Linear Programs (ILPs) are powerful tools for modeling and solving many combinatorial optimization problems. Recently, it has been shown that Large Neighborhood Search (LNS), as a heuristic algorithm, can find high-quality solutions to ILPs faster than Branch and Bound. However, how to find the right heuristics to maximize the performance of LNS remains an open problem. In this paper, we propose a novel approach, CL-LNS, that delivers state-of-the-art anytime performance on several ILP benchmarks measured by metrics including the primal gap, the primal integral, survival rates and the best performing rate. Specifically, CL-LNS collects positive and negative solution samples from an expert heuristic that is slow to compute and learns a more efficient one with contrastive learning. |
Taoan Huang · Aaron Ferber · Yuandong Tian · Bistra Dilkina · Benoit Steiner 🔗 |
-
|
Reinforcement Learning for Sampling on Temporal Medical Imaging Sequences
(
Poster
)
link
Accelerated magnetic resonance imaging resorts to either Fourier-domain subsampling or better reconstruction algorithms to deal with fewer measurements while still generating medical images of high quality. Determining the optimal sampling strategy given a fixed reconstruction protocol often has combinatorial complexity. In this work, we apply double deep Q-learning and REINFORCE algorithms to learn the sampling strategy for dynamic image reconstruction. We consider the data in the format of time series, and the reconstruction method is a pre-trained autoencoder-typed neural network. We present a proof of concept that reinforcement learning algorithms are effective to discover the optimal sampling pattern which underlies the pre-trained reconstructor network (i.e., the dynamics in the environment). The code for replicating experiments can be found at https://github.com/zhishenhuang/RLsamp. |
Zhishen Huang 🔗 |
-
|
Complex Preferences for Different Convergent Priors in Discrete Graph Diffusion
(
Poster
)
link
Diffusion models have achieved state-of-the-art performance in generating many different kinds of data, including images, text, and videos. Despite their success, there has been limited research on how the underlying diffusion process and the final convergent prior can affect generative performance; this research has also been limited to continuous data types and a score-based diffusion framework. To fill this gap, we explore how different discrete diffusion kernels (which converge to different prior distributions) affect the performance of diffusion models for graphs. To this end, we developed a novel formulation of a family of discrete diffusion kernels which are easily adjustable to converge to different Bernoulli priors, and we study the effect of these different kernels on generative performance. We show that the quality of generated graphs is sensitive to the prior used, and that the optimal choice cannot be explained by obvious statistics or metrics, which challenges the intuitions which previous works have suggested. |
Alex Tseng · Nathaniel Diamant · Tommaso Biancalani · Gabriele Scalia 🔗 |
-
|
Sequential Attention for Feature Selection
(
Poster
)
link
Feature selection is the problem of selecting a subset of features for a machine learning modelthat maximizes model quality subject to a budget constraint.For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques,typically select the entire feature subset in one evaluation round,ignoring the residual value of features during selection,i.e., the marginal contribution of a feature given that other features have already been selected.We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical resultsfor neural networks.This algorithm is based on an efficient one-pass implementation of greedy forward selectionand uses attention weights at each step as a proxy for feature importance.We give theoretical insights into our algorithm for linear regressionby showing that an adaptation to this setting is equivalent to theclassical Orthogonal Matching Pursuit (OMP) algorithm,and thus inherits all of its provable guarantees.Our theoretical and empirical analyses offer new explanations towards the effectiveness of attentionand its connections to overparameterization, which may be of independent interest.
|
Taisuke Yasuda · Mohammad Hossein Bateni · Lin Chen · Matthew Fahrbach · Thomas Fu · Vahab Mirrokni 🔗 |
-
|
Tensor Proxies for Efficient Feature Cross Search
(
Poster
)
link
Feature crossing is a popular method for augmenting the feature set of a machine learning model by taking the Cartesian product of a small number of existing categorical features. While feature crosses have traditionally been hand-picked by domain experts, a recent line of work has focused on the automatic discovery of informative feature crosses. Our work proposes a simple yet efficient and effective approach to this problem using tensor proxies as well as a novel application of the attention mechanism to convert the combinatorial problem of feature cross search to a continuous optimization problem. By solving the continuous optimization problem and then rounding the solution to a feature cross, we give a highly efficient algorithm for feature cross search that trains only a single model for feature cross searching, unlike prior greedy methods that require training a large number of models. Through extensive empirical evaluations, we show that our algorithm is not only efficient, but also discovers more informative feature crosses that allow us to achieve state-of-the-art empirical results for feature cross models. Furthermore, even without the rounding step, we obtain a novel DNN architecture for augmenting existing models with a small number of features to improve quality without introducing any feature crosses. This avoids the cost of storing additional large embedding tables for these feature crosses. |
Taisuke Yasuda · Mohammad Hossein Bateni · Lin Chen · Matthew Fahrbach · Thomas Fu 🔗 |
-
|
Guided Evolution with Binary Predictors for ML Program Search
(
Poster
)
link
Primitive-based evolutionary AutoML discovers novel state-of-the-art ML components by searching over programs built from low-level building blocks. While very expressive, these spaces have sparsely distributed good performing candidates. This poses great challenges in efficient search. Performance predictors have proven successful in speeding up search in smaller and denser Neural Architecture Search (NAS) spaces, but they have not yet been tried on these larger primitive-based search spaces. Through a unified graph representation to encode a wide variety of ML components, we train a binary classifier online to predict which of two given candidates is better. We then present an adaptive mutation method that leverages the learned binary predictor and show how it improves local search. We empirically demonstrate our method speeds up end-to-end evolution across a set of diverse problems including a 3.7x speedup on the symbolic search for ML optimizers and a 4x speedup for RL loss functions. |
John Co-Reyes · Yingjie Miao · George Tucker · Aleksandra Faust · Esteban Real 🔗 |
-
|
Discrete Diffusion Reward Guidance Methods for Offline Reinforcement Learning
(
Poster
)
link
As reinforcement learning challenges involve larger amounts of data in different forms, new techniques will be required in order to generate high-quality plans with only a compact representation of the original information. While novel diffusion generative policies have provided a way to model complex action distributions directly in the original, high-dimensional feature space, they suffer from slow inference speed and have not yet been applied with reduced dimension or to discrete tasks. In this work, we propose three diffusion-guidance techniques with a reduced representation of the state provided by quantile discretization: a gradient-based approach, a stochastic beam search approach, and a Q-learning approach. Our findings indicate that the gradient-based and beam search approaches are capable of improving scores on an offline reinforcement learning task by a significant margin. |
Matthew Coleman · Olga Russakovsky · Christine Allen-Blanchette · Ye Zhu 🔗 |
-
|
Categorical SDEs with Simplex Diffusion
(
Poster
)
link
Diffusion models typically operate in the standard framework of generative modelling by producing continuously-valued datapoints. To this end, they rely on a progressive Gaussian smoothing of the original data distribution, which admits an SDE interpretation involving increments of a standard Brownian motion. However, some applications such as text generation or reinforcement learning might naturally be better served by diffusing categorical-valued data, i.e., lifting the diffusion to a space of probability distributions. To this end, this short theoretical note proposes Simplex Diffusion, a means to directly diffuse datapoints located on an $n$-dimensional probability simplex. We show how this relates to the Dirichlet distribution on the simplex and how the analogous SDE is realized thanks to a multi-dimensional Cox-Ingersoll-Ross process (abbreviated as CIR), previously used in economics and mathematical finance. Finally, we make remarks as to the numerical implementation of trajectories of the CIR process, and discuss some limitations of our approach.
|
Pierre Richemond · Sander Dieleman · Arnaud Doucet 🔗 |
-
|
SOBER: Highly Parallel Bayesian Optimization and Bayesian Quadrature over Discrete and Mixed Spaces
(
Poster
)
link
Batch Bayesian optimisation and Bayesian quadrature have been shown to be sample-efficient methods of performing optimisation and quadrature where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes — a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch global optimisation and quadrature with arbitrary acquisition functions and kernels over discrete and mixed spaces. The key to our approach is to reformulate batch selection for global optimisation as a quadrature problem, which relaxes acquisition function maximisation (non-convex) to kernel recombination (convex). Bridging global optimisation and quadrature can efficiently solve both tasks by balancing the merits of exploitative Bayesian optimisation and explorative Bayesian quadrature. We show that SOBER outperforms 11 competitive baselines on 12 synthetic and diverse real-world tasks. |
Masaki Adachi · Satoshi Hayakawa · Saad Hamid · Martin Jørgensen · Harald Oberhauser · Michael A Osborne 🔗 |
-
|
Solving NP-hard Min-max Routing Problems as Sequential Generation with Equity Context
(
Poster
)
link
Min-max routing problems aim to minimize the maximum tour length among agents as they collaboratively visit all cities, i.e., the completion time. These problems include impactful real-world applications but are known as NP-hard. Existing methods are facing challenges, particularly in large-scale problems that require the coordination of numerous agents to cover thousands of cities. This paper proposes a new deep-learning framework to solve large-scale min-max routing problems. We model the simultaneous decision-making of multiple agents as a sequential generation process, allowing the utilization of scalable deep-learning models for sequential decision-making. In the sequentially approximated problem, we propose a scalable contextual Transformer model, Equity-Transformer, which generates sequential actions considering an equitable workload among other agents. The effectiveness of Equity-Transformer is demonstrated through its superior performance in two representative min-max routing tasks: the min-max multiple traveling salesman problem (min-max mTSP) and the min-max multiple pick-up and delivery problem (min-max mPDP). Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1,000 cities of mTSP. |
Jiwoo Son · Minsu Kim · Sanghyeok Choi · Hyeonah Kim · Jinkyoo Park 🔗 |
-
|
Optimizing protein fitness using Gibbs sampling with Graph-based Smoothing
(
Poster
)
link
The ability to design novel proteins with higher fitness on a given task would be revolutionary for many fields of medicine. However, brute-force search through the combinatorially large space of sequences is infeasible. Prior methods constrain search to a small mutational radius from a reference sequence, but such heuristics drastically limit the design space. Our work seeks to remove the restriction on mutational distance while enabling efficient exploration. We propose Gibbs sampling with Graph-based Smoothing (GGS) which iteratively applies Gibbs with gradients to propose advantageous mutations using graph-based smoothing to remove noisy gradients that lead to false positives. Our method is state-of-the-art in discovering high-fitness proteins with up to 8 mutations from the training set. We study the GFP and AAV design problems, ablations, and baselines to elucidate the results. Code: https://github.com/kirjner/GGS |
Andrew Kirjner · Jason Yim · Raman Samusevich · Tommi Jaakkola · Regina Barzilay · Ila R. Fiete 🔗 |
-
|
Symmetric Exploration in Combinatorial Optimization is Free!
(
Poster
)
link
Recently, deep reinforcement learning (DRL) has shown promise in solving combinatorial optimization (CO) problems. However, they often require a large number of evaluations on the objective function, which can be time-consuming in real-world scenarios. To address this issue, we propose a "free" technique to enhance the performance of any deep reinforcement learning (DRL) solver by exploiting symmetry without requiring additional objective function evaluations. Our key idea is to augment the training of DRL-based combinatorial optimization solvers by reward-preserving transformations. The proposed algorithm is likely to be impactful since it is simple, easy to integrate with existing solvers, and applicable to a wide range of combinatorial optimization tasks. Extensive empirical evaluations on NP-hard routing optimization, scheduling optimization, and de novo molecular optimization confirm that our method effortlessly improves the sample efficiency of state-of-the-art DRL algorithms. |
Hyeonah Kim · Minsu Kim · Sungsoo Ahn · Jinkyoo Park 🔗 |
-
|
An Optimal Clustering Algorithm for the Labeled Stochastic Block Model
(
Poster
)
link
This paper considers the clustering problem in the Labeled Stochastic Block Model (LSBM) from the observations of labels. For this model, we assume that the cluster size increases linearly with the number of nodes $n$. Our goal is to develop an efficient algorithm to identify the clusters based on the observed labels. We reexamine instance-specific lower bounds on the expected number of misclassified items. These bounds must be satisfied by any clustering algorithm. We propose Instance-Adaptive Clustering (IAC), the first algorithm that matches the lower bounds in expectation. IAC combines a one-time spectral clustering method with an iterative likelihood-based cluster assignment refinement procedure. This technique relies on the instance-specific lower bound and does not necessitate any model parameters, including the number of clusters. IAC retains an overall computational complexity of $\mathcal{O}(n \text{polylog}(n))$. We demonstrate the efficacy of our approach through numerical experiments.
|
Kaito Ariu · Se-Young Yun · Alexandre Proutiere 🔗 |