Timezone: »
Learning from human preferences or preferencebased learning has been critical to major advancements in AI and machine learning. Since human beings are naturally more reliable at providing feedback on a relative scale compared to numerical values, collecting preference feedback is more budgetfriendly and involves less bias. The broad objective of this workshop is twofold:1) Bring together different communities where preferencebased learning has played a major role. This includes dueling bandits, multiagent games, econometrics, social choice theory, reinforcement learning, optimization, robotics and many more, for which we aim to create a suitable forum to exchange techniques, ideas, learn from each other and potentially create new and innovative research questions. 2) Connect theory to practice by identifying realworld systems which can benefit from incorporating preference feedback, such as marketing, revenue management, search engine optimization, recommender systems, healthcare, language modeling, interactive chatbots, text summarization, robotics, and so on.We will consider our workshop a success if it inspires researchers to embark on novel insights in the general area of preferencebased learning: Bringing attention from different communities to foster dissemination, crossfertilization and discussion at scale. Especially, building bridges between experimental researchers and theorists towards developing better models and practical algorithms, and encouraging participants to propose, sketch, and discuss new starting points, questions or applications.
Fri 12:00 p.m.  12:05 p.m.

Opening Remarks
SlidesLive Video » 
🔗 
Fri 12:05 p.m.  12:35 p.m.

MNLBandit: Sequential Learning Approach to Assortment Selection
(
Invited Talk
)
SlidesLive Video » 
Vineet Goyal 🔗 
Fri 12:35 p.m.  1:05 p.m.

Aligning Robots with Human Preferences
(
Invited Talk
)
SlidesLive Video » Aligning robot objectives with human preferences is a key challenge in robot learning. In this talk, I will start with discussing how active learning of human preferences can effectively query humans with the most informative questions to learn their preference reward functions. I will discuss some of the limitations of prior work, and how approaches such as fewshot learning can be integrated with active preference based learning for the goal of reducing the number of queries to a human expert and allowing for truly bringing in humans in the loop of learning neural reward functions. I will then talk about how we could go beyond active learning from a single human, and tap into large language models (LLMs) as another source of information to capture human preferences that are hard to specify. I will discuss how LLMs can be queried within a reinforcement learning loop and help with reward design. Finally I will discuss how the robot can also provide useful information to the human and be more transparent about its learning process. We demonstrate how the robot’s transparent behavior would guide the human to provide compatible demonstrations that are more useful and informative for learning. 
Dorsa Sadigh 🔗 
Fri 1:05 p.m.  1:50 p.m.

1st Poster Session
(
Poster Session
)

🔗 
Fri 1:50 p.m.  2:20 p.m.

Learning from Pairwise Preferences: From Search Rankings to ChatBots
(
Invited Talk
)
SlidesLive Video » 
Thorsten Joachims 🔗 
Fri 2:20 p.m.  2:50 p.m.

Eliciting Human Judgments for Moral Artificial Intelligence
(
Invited Talk
)
SlidesLive Video » 
Vincent Conitzer 🔗 
Fri 2:50 p.m.  3:05 p.m.

Pretrained deep models outperform GBDTs in LearningToRank under label scarcity
(
Oral
)
link »
SlidesLive Video » While deep learning (DL) models are stateoftheart in text and image domains, they have not yet consistently outperformed Gradient Boosted Decision Trees (GBDTs) on tabular LearningToRank (LTR) problems. Most of the recent performance gains attained by DL models in text and image tasks have used unsupervised pretraining, which exploits orders of magnitude more unlabeled data than labeled data. To the best of our knowledge, unsupervised pretraining has not been applied to the LTR problem, which often produces vast amounts of unlabeled data.In this work, we study whether unsupervised pretraining can improve LTR performance over GBDTs and other nonpretrained models. Using simple design choicesincluding SimCLRRank, our rankingspecific modification of SimCLR (an unsupervised pretraining method for images)we produce pretrained deep learning models that soundly outperform GBDTs (and other nonpretrained models) in the case where labeled data is vastly outnumbered by unlabeled data. We also show that pretrained models also often achieve significantly better robustness than nonpretrained models (GBDTs or DL models) in ranking outlier data. 
Charlie Hou · Kiran Thekumparampil · Michael Shavlovsky · Giulia Fanti · Yesh Dattatreya · Sujay Sanghavi 🔗 
Fri 3:05 p.m.  3:20 p.m.

ZerothOrder Optimization Meets Human Feedback: Provable Learning via Ranking Oracles
(
Oral
)
link »
SlidesLive Video » In this study, we delve into an emerging optimization challenge involving a blackbox objective function that can only be gauged via a ranking oracle—a situation frequently encountered in realworld scenarios, especially when the function is evaluated by human judges. A prominent instance of such a situation is Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance \citep{ouyang2022training,liu2023languages,chatgpt,bai2022training}. We introduce ZORankSGD, an innovative zerothorder optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rankbased random estimator to determine the descent direction and guarantees convergence to a stationary point. We demonstrate the effectiveness of ZORankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZORankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zerothorder optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions. 
Zhiwei Tang · Dmitry Rybin · TsungHui Chang 🔗 
Fri 3:20 p.m.  4:15 p.m.

2nd Poster Session
(
Poster session
)

🔗 
Fri 4:15 p.m.  4:45 p.m.

Vignettes on PairwiseFeedback Mechanisms for Learning with Uncertain Preferences
(
Invited Talk
)
SlidesLive Video » 
Sanmi Koyejo 🔗 
Fri 4:45 p.m.  5:15 p.m.

Efficient Optimization with Many Objectives
(
Invited Talk
)
SlidesLive Video » 
Eytan Bakshy 🔗 
Fri 5:15 p.m.  5:30 p.m.

Preference Proxies: Evaluating Large Language Models in capturing Human Preferences in HumanAI Tasks
(
Oral
)
link »
SlidesLive Video » In this work, we investigate the potential of Large Language Models (LLMs) to serve as effective human proxies by capturing human preferences in the context of collaboration with AI agents. Focusing on two key aspects of human preferences  explicability and subtask specification in team settings  we explore LLMs' ability to not only model mental states but also understand human reasoning processes. By developing scenarios where optimal AI performance relies on modeling human mental states and reasoning, our investigation involving two different preference types and a user study (with 17 participants) contributes valuable insights into the suitability of LLMs as ``Preference Proxies" in various humanAI applications, paving the way for future research on the integration of AI agents with human users in HumanAware AI tasks. 
Mudit Verma · Siddhant Bhambri · Subbarao Kambhampati 🔗 
Fri 5:30 p.m.  5:45 p.m.

Learning Optimal Advantage from Preferences and Mistaking it for Reward
(
Oral
)
link »
SlidesLive Video » Most recent work that involves learning reward functions from human preferences over pairs of trajectory segmentsas used in reinforcement learning from human feedback (RLHF), including for ChatGPT and many contemporary language modelsare built with the assumption that such human preferences are generated based only upon the reward accrued within those segments, which we call their partial return.But if this assumption is false because people base their preferences on information other than partial return, then what type of function is their algorithm learning from preferences? We argue that this function is better thought of as an approximation of the optimal advantage function, not a reward function as previously believed. 
William Knox · Stephane HatgisKessell · Sigurdur Adalgeirsson · Serena Booth · Anca Dragan · Peter Stone · Scott Niekum 🔗 
Fri 5:45 p.m.  6:30 p.m.

3rd Poster Session
(
Poster session
)

🔗 
Fri 6:30 p.m.  7:00 p.m.

Dueling Bandits for Online Preference Learning
(
Invited Talk
)
SlidesLive Video » 
Yisong Yue 🔗 
Fri 7:00 p.m.  7:30 p.m.

Is RLHF More Difficult than Standard RL?
(
Invited Talk
)
SlidesLive Video » 
Chi Jin 🔗 
Fri 7:30 p.m.  7:45 p.m.

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$wise Comparisons
(
Oral
)
link »
SlidesLive Video »
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the BradleyTerryLuce (BTL) model and the PlackettLuce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max entropy IRL.

Banghua Zhu · Michael Jordan · Jiantao Jiao 🔗 
Fri 7:45 p.m.  8:00 p.m.

How to Query Human Feedback Efficiently in RL?
(
Oral
)
link »
SlidesLive Video » Reinforcement Learning with Human Feedback (RLHF) is a paradigm in which an RL agent learns to optimize a task using pairwise preferencebased feedback over trajectories, rather than explicit reward signals. While RLHF has demonstrated practical success in finetuning language models, existing empirical work does not address the challenge of how to efficiently sample trajectory pairs for querying human feedback. In this study, we propose an efficient sampling approach to acquiring exploratory trajectories that enable accurate learning of hidden reward functions before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preferencebased models with linear parameterization and unknown transitions, compared to the existing literature. Specifically, our framework can incorporate linear and lowrank MDPs with efficient sample complexity. Additionally, we investigate RLHF with actionbased comparison feedback and introduce an efficient querying algorithm tailored to this scenario. 
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee 🔗 


MultiObjective Agency Requires NonMarkovian Rewards
(
Poster
)
link »
As the capabilities of artificial agents improve, they are being increasingly deployed to service multiple diverse objectives and stakeholders. However, the composition of these objectives is often performed ad hoc, with no clear justification. This paper takes a normative approach to multiobjective agency: from a set of intuitively appealing axioms, we show that Markovian aggregation of Markovian reward functions is not possible when the time preference (discount factor) for each objective may vary. It follows that optimal multiobjective agents must admit rewards that are nonMarkovian (history dependent) with respect to the individual objectives. To this end, we propose a practical nonMarkovian aggregation scheme that overcomes the impossibility with only one additional parameter for each objective. Our work offers new insights into sequential, multiobjective agency and intertemporal choice, and has practical implications for the design of AI systems deployed to serve multiple generations of principals with varying time preference. 
Silviu Pitis 🔗 


Failure Modes of Learning Reward Models for LLMs and other Sequence Models
(
Poster
)
link »
To align large language models (LLMs) and other sequencebased models with human values, we typically assume that human preferences can be well represented using a "reward model". We infer the parameters of this reward model from data, and then train our models to maximize reward. Effective alignment with this approach relies on a strong reward model, and reward modeling becomes increasingly important as the dominion of deployed models grows. Yet in practice, we often assume the existence of a particular reward model, without regard to its potential shortcomings. In this preliminary work, I survey several failure modes of learned reward models, which may be organized into three broad categories: model misspecification, ambiguous preferences, and reward misgeneralization. Several avenues for future work are identified. It is likely that I have missed several points and related works; to that end, I greatly appreciate your correspondence. 
Silviu Pitis 🔗 


VideoGuided Skill Discovery
(
Poster
)
link »
We study how embodied agents can use passive data, such as videos, to guide the discovery of useful and diverse skills. Existing datasets have the potential to be an abundant and rich source of examples for robot learning, revealing not only what tasks to do, but also how to achieve them. Without structural priors, existing approaches to skill discovery are often underspecified and ineffective in realworld, highDoF settings. Our approach uses the temporal information in videos to learn structured representations of the world that can then be used to create shaped rewards for efficiently learning from openended play and finetuning to target tasks. We demonstrate the ability to effectively learn skills by leveraging actionfree video data in a kitchen manipulation setting and on synthetic control tasks. 
Manan Tomar · Dibya Ghosh · Vivek Myers · Anca Dragan · Matthew Taylor · Philip Bachman · Sergey Levine 🔗 


Learning from Pairwise Comparisons Under Preference Reversals
(
Poster
)
link »
We consider the problem of learning to rank from pairwise comparisons in the presence of contextual preference reversals. Preference reversal is a phenomenon well studied in the social psychology and cognitive science literature where users are known to reverse their preference over a pair of alternatives when a carefully chosen third alternative is also presented in the list from which they are required to make a choice. This pertinent effect has been largely ignored in standard representation learning models for pairwise comparisons. In this work, we propose a flexible pairwise comparison model capable of modeling the preference reversal effect. We show that the model is rich enough to capture intransitive preference relations that arise due to reversals. We develop a coupled interpretable neural network based algorithm that learns embeddings for the items from pairwise comparisons. Our network is interpretable as one part of the network learns the standard transitive score based BradleyTerryLuce (BTL) Model while the other part explicitly learns the preference reversal effect. We perform experiments to show the efficacy of the proposed network on synthetic datasets against a standard spectral ranking based algorithm and a standard deep network in terms of prediction accuracy on a heldout dataset and the ability of the model to capture intransitive relationships. 
Abdul Bakey Mir · Arun Rajkumar 🔗 


Randomized Smoothing (almost) in Real Time?
(
Poster
)
link »
Certifying the robustness of Deep Neural Networks (DNNs) is very important in safetycritical domains. Randomized Smoothing (RS) has been recently proposed as a scalable, modelagnostic method for robustness verification, which has achieved excellent results and has been extended for a large variety of adversarial perturbation scenarios. However, a hidden cost in RS is during interference, since it requires passing \emph{tensofthousands} perturbed samples through the DNN in order to perform the verification. In this work, we try to address this challenge, and explore what it would take to perform RS much faster, perhaps even in realtime, and what happens as we decrease the number of samples by orders of magnitude. Surprisingly, we find that \emph{the performance reduction in terms of average certified radius is not too large, even if we decrease the number of samples by two orders of magnitude, or more}. This could possibly pave the way even for realtime robustness certification, under suitable settings. We perform a detailed analysis, both theoretically and experimentally, and show promising results on the standard CIFAR10 and ImageNet datasets. 
Emmanouil Seferis 🔗 


Exploiting Action Distances for Reward Learning from Human Preferences
(
Poster
)
link »
Preferencebased Reinforcement Learning (PbRL) with binary preference feedbacks over trajectory pairs has proved to be quite effective in learning complex preferences of a human in the loop in domains with high dimensional state spaces and action spaces. While the human preference is primarily inferred from the feedback provided, we propose that, in situations where the human preferences are goaloriented, the policy being learned (jointly with the reward model) during training can also provide valuable learning signal about the probable goal based on the human preference. To utilize this information, we introduce an action distance measure based on the policy and use it as an auxiliary prediction task for reward learning. This measure not only provides insight into the transition dynamics of the environment but also informs about the reachability of states under the policy by giving a distance to goal measure. We choose six tasks with goaloriented preferences in the MetaWorld domains to evaluate the performance and sample efficiency of our approach. We show that our approach outperforms methods leveraging auxiliary tasks of learning environment dynamics or a nontemporal distance measure adapted by PbRL baselines. Additionally, we show that action distance measure can also accelerate policy learning which is reaffirmed by our experimental results. 
Mudit Verma · Siddhant Bhambri · Subbarao Kambhampati 🔗 


Reward Collapse in Aligning Large Language Models: A PromptAware Approach to Preference Rankings
(
Poster
)
link »
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of $\textit{reward collapse}$, an empirical observation where the prevailing rankingbased approach results in an $\textit{identical}$ reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as openended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital city of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the rankingbased objective function to incorporate promptrelated information during optimization. This insight allows us to derive closedform expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. To overcome reward collapse, we introduce a promptaware optimization scheme that provably admits a promptdependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed promptaware utility functions significantly alleviate reward collapse during the training of reward models.

Ziang Song · Tianle Cai · Jason Lee · Weijie Su 🔗 


Direct Preference Optimization: Your Language Model is Secretly a Reward Model
(
Poster
)
link »
While largescale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and finetune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then finetuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during finetuning, or performing significant hyperparameter tuning. Our experiments show that DPO can effectively finetune LMs with human preferences as well or better than existing algorithms. Notably, finetuning with DPO matches or exceeds RLHF's ability to control sentiment of generations and improve response quality in summarization, while being substantially simpler to implement and train. 
Rafael Rafailov · Archit Sharma · Eric Mitchell · Stefano Ermon · Christopher Manning · Chelsea Finn 🔗 


Ranking with Abstention
(
Poster
)
link »
We introduce a novel framework of *ranking with abstention*, where the learner can abstain from making prediction at some limited cost $c$. We present a extensive theoretical analysis of this framework including a series of *$H$consistency bounds* for both the family of linear functions and that of neural networks with one hiddenlayer. These theoretical guarantees are the stateoftheart consistency guarantees in the literature, which are upper bounds on the target loss estimation error of a predictor in a hypothesis set $H$, expressed in terms of the surrogate loss estimation error of that predictor. We further argue that our proposed abstention methods are important when using common equicontinuous hypothesis sets in practice. We report the results of experiments illustrating the effectiveness of ranking with abstention.

Anqi Mao · Mehryar Mohri · Yutao Zhong 🔗 


Learning Higher Order Skills that Efficiently Compose
(
Poster
)
link »
Hierarchical reinforcement learning allows an agent to effectively solve complex tasks by leveraging the compositional structures of tasks and executing a sequence of skills. However, our examination shows that prior work focuses on learning individual skills without considering how to efficiently combine them, which can lead to suboptimal performance.To address this problem, we propose a novel framework, called secondorder skills (SOS), for learning skills to facilitate the efficient execution of skills in sequence. Specifically, second order skills (which can be generalized to higher orders) aim to learn skills from an extended perspective that takes into account the next skill required to accomplish a task.We theoretically demonstrate that our method guarantees more efficient performance in the downstream task compared to previous approaches that do not consider secondorder skills. Also, our empirical experiments show that learning secondorder skills results in improved learning performance compared to stateoftheart in baselines across diverse benchmark domains. 
Anthony Liu · Dong Ki Kim · Sungryull Sohn · Honglak Lee 🔗 


DIPRL: DemonstrationInferred Preference Learning in Minecraft
(
Poster
)
link »
In machine learning for sequential decisionmaking, an algorithmic agent learns to interact with an environment while receiving feedback in the form of a reward signal. However, in many unstructured realworld settings, such a reward signal is unknown and humans cannot reliably craft a reward signal that correctly captures desired behavior. To solve tasks in such unstructured and openended environments, we present DemonstrationInferred Preference Reinforcement Learning (DIPRL), an algorithm that leverages human demonstrations in three distinct ways, including training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL. We evaluate DIPRL in a treechopping task in Minecraft. Results suggest that the method can guide an RL agent to learn a reward function that reflects human preferences and that DIPRL performs competitively relative to baselines. DIPRL is inspired by our previous work on combining demonstrations and pairwise preferences in Minecraft, which was awarded a research prize at the 2022 NeurIPS MineRL BASALT competition, Learning from Human Feedback in Minecraft. Example trajectory rollouts of DIPRL and baselines are located at https://sites.google.com/view/diprl. 
Ellen Novoseller · Vinicius G. Goecks · David Watkins · Josh Miller · Nicholas Waytowich 🔗 


Differentially Private Reward Estimation from Preference Based Feedback
(
Poster
)
link »
Preferencebased reinforcement learning (RL) has gained attention as a promising approach to align learning algorithms with human interests in various domains. Instead of relying on numerical rewards, preferencebased RL uses feedback from human labelers in the form of pairwise or $K$wise comparisons between actions. In this paper, we focus on reward learning in preferencebased RL and address the issue of estimating unknown parameters while protecting privacy. We propose two estimators based on the Randomized Response strategy that ensure label differential privacy. The first estimator utilizes maximum likelihood estimation (MLE), while the second estimator employs stochastic gradient descent (SGD). We demonstrate that both estimators achieve an estimation error of $\widetilde O(1/\sqrt{n})$ with $n$ number of samples. The additional cost of ensuring privacy for human labelers is proportional to $\frac{e^\epsilon +1 }{e^\epsilon 1}$ in the best case, where $\epsilon>0$ is the privacy

Sayak Ray Chowdhury · Xingyu Zhou 🔗 


Intention is what you need to estimate: Attentiondriven prediction of goal pose in a humancentric telemanipulation of a robotic hand
(
Poster
)
link »
This work entails remote telemanipulation of certain objects using Dexmo Haptic glove (DHG) and Allegro Robotic Hand (ARH). We introduce an estimation mechanism to quantify the expected goal pose of fingers of the human user, wearing the DHG, as its intent, defined in terms of the expected rotation angle of the object (about the viewing plane) that is held between the endeffectors of ARH. A significant amount of delay is observed to generate this intent due to communication and control latencies when the robot is remotely controlled. Hence, an attention based mechanism is leveraged to model the trajectory of estimated intent and predict its estimate for a lookahead of 'm' time units from the current n^{th} estimated sample to compensate for the delays. We evaluate the performances of the estimation mechanism, and the attention mechanism on the stated robotic setup in a realwork networking scenario against some benchmark methodologies. The effect of varying lookahead is analysed against the accuracy of estimation/prediction of the intent. The testing MSE achieved in prediction of the human intent (utilizing attention model) is reported to be 0.00047 for m=1, which characterizes as ~3842 times lesser in comparison to our previous work (utilizing LSTM). 
MUNEEB AHMED · Rajesh Kumar · Arzad Kherani · Brejesh Lall 🔗 


Representation Learning in Lowrank Slatebased Recommender Systems
(
Poster
)
link »
Reinforcement learning (RL) in recommendation systems offers the potential to optimize recommendations for longterm user engagement. However, the environment often involves large state and action spaces, which makes it hard to efficiently learn and do efficient exploration. In this work, we propose a sampleefficient representation learning algorithm, using the standard slate recommendation setup, to treat this as an online RL problem with lowrank Markov decision processes (MDPs). We also construct the recommender simulation environment with the proposed setup and sampling method. 
Yijia Dai · Wen Sun 🔗 


Borda Regret Minimization for Generalized Linear Dueling Bandits
(
Poster
)
link »
Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order $\Omega(d^{2/3} T^{2/3})$ for the Borda regret minimization problem, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain this lower bound, we propose an explorethencommit type algorithm for the stochastic setting, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. We also propose an EXP3type algorithm for the adversarial setting, where the underlying model parameter can change at each round. Our algorithm achieves an $\tilde{O}(d^{2/3} T^{2/3})$ regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated realworld environment are conducted to corroborate our theoretical analysis.

Yue Wu · Tao Jin · Qiwei Di · Hao Lou · Farzad Farnoud · Quanquan Gu 🔗 


Learning Populations of Preferences via Pairwise Comparison Queries
(
Poster
)
link »
Ideal point based preference learning using pairwise comparisons of type "Do you prefer $a$ or $b$?" has emerged as a powerful tool for understanding how we make preferences which plays a key role in many areas. Existing preference learning approaches assume homogeneity and focus on learning preference on average over the population or require a large number of queries per individual to localize individual preferences. However, in practical scenarios with heterogeneous preferences and limited availability of responses, these approaches are impractical. Therefore, we introduce the problem of learning the distribution of preferences over a population via pairwise comparisons using only one response per individual. In this scenario, learning each individual's preference is impossible. Hence the question of interest is: what can we learn about the distribution of preferences over the population? Due to binary answers from comparison queries, we focus on learning the mass of the underlying distribution in the regions (polytopes) created by the intersection of bisecting hyperplanes between queried pairs of points. We investigate this fundamental question in both 1D and higher dimensional settings with noiseless response to comparison queries. We show that the problem is identifiable in 1D setting and provide recovery guarantees. We also show that the problem is not identifiable for higher dimensional settings. We propose using a regularized recovery for higher dimensional settings and provide guarantees on the total variation distance between the true mass in each of the regions and the distribution learned via regularized constrained optimization problem. We validate our findings through simulations and experiments on real datasets. We also introduce a new dataset for this task collected on a real crowdsourcing platform.

Gokcan Tatli · Yi Chen · Ramya Vinayak 🔗 


A Ranking Game for Imitation Learning
(
Poster
)
link »
We propose a new framework for imitation learningtreating imitation as a twoplayer rankingbased game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, nearoptimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a highdimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed rankinggame framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves stateoftheart sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. 
Harshit Sikchi · Akanksha Saran · Wonjoon Goo · Scott Niekum 🔗 


AdaptiveRec: Adaptively Construct Pairs for Contrastive Learning in Sequential Recommendation
(
Poster
)
link »
This paper presents a solution to the challenges faced by contrastive learning in sequential recommendation systems. In particular, it addresses the issue of false negative, which limits the effectiveness of recommendation algorithms. By introducing an advanced approach to contrastive learning, the proposed method improves the quality of item embeddings and mitigates the problem of falsely categorizing similar instances as dissimilar. Experimental results demonstrate performance enhancements compared to existing systems. The flexibility and applicability of the proposed approach across various recommendation scenarios further highlight its value in enhancing sequential recommendation systems. 
Jae Heyoung Jeon · Jung Hyun Ryu · Jewoong Cho · Myungjoo Kang 🔗 


Perceptual adjustment queries: An inverted measurement paradigm for lowrank metric learning
(
Poster
)
link »
We introduce a new type of informative and yet cognitively lightweight query mechanism for collecting human feedback, called the perceptual adjustment query (PAQ). The PAQ combines advantages from both ordinal and cardinal queries. We showcase the PAQ mechanism by collecting observations on a metric space involving an unknown Mahalanobis distance, and consider the problem of learning this metric from PAQ measurements. This gives rise to a type of high dimensional, lowrank matrix estimation problem under a new measurement scheme to which standard matrix estimators cannot be applied. Consequently, we develop a twostage estimator for metric learning from PAQs, and provide sample complexity guarantees for this estimator. We demonstrate the performance along with various properties of the estimator by extensive numerical simulations. 
Austin Xu · Andrew McRae · Jingyan Wang · Mark Davenport · Ashwin Pananjady 🔗 


Ratingbased Reinforcement Learning
(
Poster
)
link »
This paper develops a novel ratingbased reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preferencebased and rankingbased reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed ratingbased reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The ratingbased reinforcement learning approach builds on a new prediction model for human ratings and a novel multiclass loss function. We conduct several experiment studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new ratingbased reinforcement learning approach. 
Devin White · Mingkang Wu · Ellen Novoseller · Vernon Lawhern · Nicholas Waytowich · Yongcan Cao 🔗 


HIPRL: Hallucinated Inputs for Preferencebased Reinforcement Learning in Continuous Domains
(
Poster
)
link »
Preferencebased Reinforcement Learning (PbRL) enables agents to learn policies based on preferences between trajectories rather than explicit reward functions. Previous approaches to PbRL are either experimental and successfully used in realworld applications but lack theoretical understanding, or they have strong theoretical guarantees but only for tabular settings.In this work, we propose a novel practical PbRL algorithm in the continuous domain called Hallucinated Inputs Preferencebased RL (HIPRL) which filled the gap between theory and practice. HIPRL parametrizes the set of transition models and uses hallucinated inputs to facilitate optimistic exploration in continuous stateaction spaces by controlling the epistemic uncertainty. We construct regret bounds for HIPRL and show that they are sublinear for Gaussian Process dynamic and reward models. Moreover, we experimentally demonstrate the effectiveness of HIPRL. 
Chen Bo Calvin Zhang · Giorgia Ramponi 🔗 


Fairness in Preferencebased Reinforcement Learning
(
Poster
)
link »
In this paper, we address the issue of fairness in preferencebased reinforcement learning (PbRL) in the presence of multiple objectives. The main objective is to design control policies that can optimize multiple objectives while treating each objective fairly. Toward this objective, we design a new fairnessinduced preferencebased reinforcement learning or FPbRL. The main idea of FPbRL is to learn vector reward functions associated with multiple objectives via new \textit{welfarebased} preferences rather than \textit{rewardbased} preference in PbRL, coupled with policy learning via maximizing a generalized Gini welfare function. Finally, we provide experiment studies on three different environments to show that the proposed FPbRL approach can achieve both efficiency and equity for learning effective and fair policies. 
Umer Siddique · Abhinav Sinha · Yongcan Cao 🔗 


Optimal Scalarizations for Sublinear Hypervolume Regret
(
Poster
)
link »
Scalarization is a general technique that can be deployed in any multiobjective setting to reduce multiple objectives into one, such as recently in RLHF for training reward models that align human preferences. Yet some have dismissed this classical approach because linear scalarizations are known to miss concave regions of the Pareto frontier. To that end, we aim to find simple nonlinear scalarizations that can explore a diverse set of $k$ objectives on the Pareto frontier, as measured by the dominated hypervolume. We show that hypervolume scalarizations with uniformly random weights are surprisingly optimal for provably minimizing the hypervolume regret, achieving an optimal sublinear regret bound of $O(T^{1/k})$, with matching lower bounds that preclude any algorithm from doing better asymptotically. As a theoretical case study, we consider the multiobjective stochastic linear bandits problem and demonstrate that by exploiting the sublinear regret bounds of the hypervolume scalarizations, we can derive a novel nonEuclidean analysis that produces improved hypervolume regret bounds of $\tilde{O}( d T^{1/2} + T^{1/k})$. We support our theory with strong empirical performance of using simple hypervolume scalarizations that consistently outperforms both the linear and Chebyshev scalarizations, as well as standard multiobjective algorithms in bayesian optimization, such as EHVI.

Richard Zhang 🔗 


FisherWeighted Merge of Contrastive Learning Models in Sequential Recommendation
(
Poster
)
link »
Along with the exponential growth of online platforms and services, recommendation systems have become essential for identifying relevant items based on user preferences. The domain of sequential recommendation aims to capture evolving user preferences over time. To address dynamic preference, various contrastive learning methods have been proposed to target data sparsity, a challenge in recommendation systems due to the limited useritem interactions. In this paper, we are the first to apply the FisherMerging method to Sequential Recommendation, addressing and resolving practical challenges associated with it. This approach ensures robust finetuning by merging the parameters of multiple models, resulting in improved overall performance. Through extensive experiments, we demonstrate the effectiveness of our proposed methods, highlighting their potential to advance the stateoftheart in sequential learning and recommendation systems. 
Jung Hyun Ryu · Jae Heyoung Jeon · Jewoong Cho · Myungjoo Kang 🔗 


Thomas: Learning to Explore Human Preference via Probabilistic Reward Model
(
Poster
)
link »
Recent breakthroughs in large language models and multimodal models underscore the impressive strides deep learning has made in tackling sophisticated tasks previously deemed achievable solely by humans. In particular, discerning human thoughts or interests via communication and feedback is garnering attention for its potential to enable machines to provide insightful responses or recommendations. Nonetheless, despite progressive developments, preference learning from human feedback is hindered by poor sample complexity, as it primarily employs preferred responses for tuning, consequently failing to holistically capture user preferences. Moreover, it is imperative to ensure diversity in the responses generated, as this diversity is instrumental in enabling users to ascertain their genuine preferences, which in turn, is conducive to the finetuning of the response generation model. In this study, we introduce a novel method known as Thomas, that utilizes active learning, Bayesian neural networks for capturing user preferences, and Thompson sampling as a decisionmaking strategy. This synergy ensures alignment of generated responses with user preferences, while preserving diversity, thus expediting the learning process. Experimental evaluations in synthetic environments affirm the proficiency of our method in swiftly adapting to user preferences and generating increasingly favored responses. 
Sang Truong · Duc Nguyen · Tho Quan · Sanmi Koyejo 🔗 


TwoSided Bandit Learning in FullyDecentralized Matching Markets
(
Poster
)
link »
Online learning in a decentralized twosided matching markets, where the demandside (players) compete to match with the supplyside (arms), has received substantial interest because it abstracts out the complex interactions in matching platforms (e.g. UpWork, TaskRabbit). However, past works \citep{liu2020competing,liu2021bandit,ucbd3,basu2021beyond,SODA} assume that the supplyside arms know their preference ranking of demandside players (onesided learning), and the players aim to learn the preference over arms through successive interactions. Moreover, several structural (and often impractical) assumptions on the problem are usually made for theoretical tractability. For example \cite{liu2020competing,liu2021bandit,SODA} assume that when a player and an arm is matched, the information of the matched pair becomes a common knowledge to all the players whereas \cite{ucbd3,basu2021beyond,ghosh2022decentralized} assume a serial dictatorship (or its variant) model where the preference rankings of the players are uniform across all arms. In this paper, we study the \emph{first} fully decentralized two sided learning, where we do not assume that the preference ranking over players are known to the arms apriori. Furthermore, we do not have any structural assumptions on the problem. We propose a multiphase explorethencommit type algorithm namely Epochbased CAETC (collision avoidance explore then commit) (\texttt{CAETC} in short) for this problem that does not require any communication across agents (players and arms) and hence fully decentralized. We show that the for the initial epoch length of $T_0$ and subsequent epochlengths of $2^{l/\gamma} T_0$ (for the $l$th epoch with $\gamma \in (0,1)$ as an input parameter to the algorithm), \texttt{CAETC} yields a player optimal expected regret of $\mathcal{O}[T_0 \left(\frac{K \log T}{T_0 (\Delta^{(i)})^2}\right)^{1/\gamma} + T_0 (T/T_0)^\gamma]$ for the $i$th player, where $T$ is the learning horizon, $K$ is the number of arms and $\Delta^{(i)}$ is an appropriately defined problem gap. Furthermore, we propose several other baselines for twosided learning for matching markets.

Tejas Pagare · Avishek Ghosh 🔗 


CalQL: Calibrated Offline RL PreTraining for Efficient Online FineTuning
(
Poster
)
link »
A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online finetuning with limited interaction. However, existing offline RL methods tend to behave poorly during finetuning. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online finetuning capabilities. Our approach, calibrated Qlearning (CalQL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Qvalues are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online finetuning, enabling us to take the benefits of offline initializations in online finetuning. In practice, CalQL can be implemented on top of the conservative Q learning (CQL) for offline RL within a oneline code change. Empirically, CalQL outperforms stateoftheart methods on 9/11 finetuning benchmark tasks that we study in this paper. 
Mitsuhiko Nakamoto · Yuexiang Zhai · Anikait Singh · Max Sobol Mark · Yi Ma · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 


Preferential MultiAttribute Bayesian Optimization with Application to Exoskeleton Personalization
(
Poster
)
link »
Preferential Bayesian optimization (PBO) is a framework for optimization of a decisionmaker's (DM's) latent preferences. Existing work in PBO assumes these preferences can be encoded by a single latent utility function, which is then estimated from ordinal preference feedback over design variables. In practice, however, it is often challenging for DMs to provide such feedback reliably, leading to poor performance. This is especially true when multiple conflicting latent attributes govern the DM's preferences. For example, in exoskeleton personalization, users' preferences over gait designs are influenced by stability and walking speed, which can conflict with each other. We posit this is a primary reason why inconsistent preferences are often observed in practice. To address this challenge, we propose a framework for preferential multiattribute Bayesian optimization, where the goal is to help DMs efficiently explore the Pareto front of their preferences over attributes. Within this framework, we propose a Thompson samplingbased strategy to select new queries and show it performs well across three test problems, including a simulated exoskeleton gait personalization task. 
Raul Astudillo · Amy Li · Maegan Tucker · Chu Xin Cheng · Aaron Ames · Yisong Yue 🔗 


Uncoupled and Convergent Learning in TwoPlayer ZeroSum Markov Games
(
Poster
)
link »
We revisit the problem of learning in twoplayer zerosum Markov games, focusing on developing an algorithm that is *uncoupled*, *convergent*, and *rational*, with nonasymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warmup, showing an $\mathcal{O}(t^{\frac{1}{8}})$ lastiterate convergence rate. To the best of our knowledge, this is the first result that obtains finite lastiterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a lastiterate convergence rate of $\mathcal{O}(t^{\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a *path convergence* rate, a new notion of convergence we define, of $\mathcal{O}(t^{\frac{1}{10}})$. Our algorithm removes the synchronization and prior knowledge requirement of [Wei et al, 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.

Yang Cai · Haipeng Luo · ChenYu Wei · Weiqiang Zheng 🔗 


PredictthenOptimize v/s Probabilistic Approximations: Tackling Uncertainties and Error Propagation
(
Poster
)
link »
Proactive planning is a key necessity for businesses to function efficiently under uncertain andunforeseen circumstances. Planning for the futureinvolves solving optimization problems, whichare often naturally convex or are modeled as convex approximations to facilitate computation. Theprimary source of uncertainties in the real worldthat business are dealing with (eg. demand) cannot be reasonably approximated by deterministicvalues. Hence deterministic convex optimizationapproximation do not not yield reasonable solutions. Classically, one relies on assumptions onthe data generating process (like for eg. that demand is log normal) to formulate as a stochasticoptimization problem. However, in today’s world,such major uncertainties are often best predictedby machine learning methods. In this paper, wepropose a novel method to integrate predictionsfrom machine learning systems and optimizationsteps for a specific context of a resource utilisation problem that faces nonstationary incomingworkload. The proposed solution is robust andshows improved performance against using thetraditional pointpredictions directly in the optimization. The proposed solution can be easilyextended to different kind of machine learningmethods and objective functions. 
Priya Shanmugasundaram · Saurabh Jha · Kumar Muthuraman 🔗 


SampleEfficient Learning of POMDPs with Multiple Observations In Hindsight
(
Poster
)
link »
This paper studies the sampleefficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worstcase. Motivated by realworld settings such as loading in game playing, we propose an enhanced feedback model called ``multiple observations in hindsight'', where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves. We show that sampleefficient learning under this feedback model is possible for two new subclasses of POMDPs: \emph{multiobservation revealing POMDPs} and \emph{distinguishable POMDPs}. Both subclasses generalize and substantially relax \emph{revealing POMDPs}a widely studied subclass for which sampleefficient learning is possible under standard trajectory feedback. Notably, distinguishable POMDPs only require the emission distributions from different latent states to be \emph{different} instead of \emph{linearly independent} as required in revealing POMDPs. 
Jiacheng Guo · Minshuo Chen · Huan Wang · Caiming Xiong · Mengdi Wang · Yu Bai 🔗 


Rethinking Incentives in Recommender Systems: Are Monotone Rewards Always Beneficial?
(
Poster
)
link »
The past decade has witnessed the flourishing of a new profession as media content creators, who rely on revenue streams from online content recommendation platforms. The rewarding mechanism employed by these platforms creates a competitive environment among creators which affects their production choices and, consequently, content distribution and system welfare. In this work, we uncover a fundamental limit about a class of widely adopted mechanisms, coined Meritbased Monotone Mechanisms, by showing that they inevitably lead to a constant fraction loss of the welfare. To circumvent this limitation, we introduce Backward Rewarding Mechanisms (BRMs) and show that the competition games resulting from BRM possess a potential game structure, which naturally induces the strategic creators' behavior dynamics to optimize any given welfare metric. In addition, the class of BRM can be parameterized so that it allows the platform to directly optimize welfare within the feasible mechanism space even when the welfare metric is not explicitly defined. 
Fan Yao · Chuanhao Li · Karthik Abinav Sankararaman · Yiming Liao · Yan Zhu · Qifan Wang · Hongning Wang · Haifeng Xu 🔗 


Learning Formal Specifications from Membership and Preference Queries
(
Poster
)
link »
Active learning is a wellstudied approach to learning formal specifications, such as automata. In this work, we extend active specification learning by proposing a novel framework that strategically requests a combination of membership labels and pairwise preferences, a popular alternative to membership labels. The combination of pairwise preferences and membership labels allows for a more flexible approach to active specification learning, which previously relied on membership labels only. We instantiate our framework in two different domains, demonstrating the generality of our approach. Our results suggest that learning from both modalities allows us to robustly and conveniently identify specifications via membership and preferences. 
Ameesh Shah · Marcell VazquezChanlatte · Sebastian Junges · Sanjit Seshia 🔗 


Kernelized Offline Contextual Dueling Bandits
(
Poster
)
link »
Preferencebased feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upperconfidencebound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts. 
Viraj Mehta · Ojash Neopane · Vikramjeet Das · Sen Lin · Jeff Schneider · Willie Neiswanger 🔗 


Preference Elicitation for Music Recommendations
(
Poster
)
link »
The cold start problem in recommender systems (RSs) makes the recommendation of highquality content to new users difficult. While preference elicitation (PE) can be used to “onboard” new users, PE in music recommendation presents unique challenges to classic PE methods, including: a vast item (music track) corpus, considerable withinuser preference diversity, multiple consumption modes (or downstream tasks), and a tight query “budget.” We develop a PE framework to address these issues, where the RS elicits user preferences w.r.t. item attributes (e.g., artists) to quickly learn coarsegrained preferences that cover a user’s tastes. We describe heuristic algorithms that dynamically select PE queries, and discuss experimental results of these methods onboarding new users in YouTube Music. 
Ofer Meshi · Jon Feldman · Li Yang · Ben Scheetz · Yanli Cai · Mohammad Hossein Bateni · Corbyn Salisbury · Vikram Aggarwal · Craig Boutilier 🔗 


SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits
(
Poster
)
link »
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multiarmed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy. 
Subhojyoti Mukherjee · Qiaomin Xie · Josiah Hanna · Robert Nowak 🔗 


Augmenting Bayesian Optimization with Preferencebased Expert Feedback
(
Poster
)
link »
Bayesian optimization (BO) is a wellestablished method to optimize blackbox functions whose direct evaluations are costly. In this paper, we tackle the problem of incorporating expert knowledge into BO, with the goal of further accelerating the optimization, which has received little attention so far. We design a multitask learning architecture for this task, with the goal of jointly eliciting the expert knowledge and minimizing the objective function. In particular, this allows for the expert knowledge to be transferred into the BO task. We introduce a specific architecture based on Siamese neural networks to handle the knowledge elicitation from pairwise queries. Experiments on various benchmark functions show that the proposed method significantly speeds up BO even when the expert knowledge is biased. 
Daolang Huang · Louis Filstroff · Petrus Mikkola · Runkai Zheng · Milica Todorovic · Samuel Kaski 🔗 


A Head Start Matters: DynamicCalibrated Representation Alignment and Uniformity for Recommendations
(
Poster
)
link »
The Bayesian personalized ranking (BPR) loss is a commonly used objective in training recommender systems, upon which various auxiliary graphbased selfsupervised contrastive learning tasks are designed for improved model robustness. Previous research has also shown that the unsupervised contrastive loss shapes the learned representations from the perspectives of alignment and uniformity, and representations with lower supervised alignment and/or uniformity loss contribute to better model performance. Despite the progress, no one neither explores how the two representation qualities evolve along the learning trajectory, nor associates the behaviors with the combination of supervised and unsupervised representation alignment and uniformity (RAU). In this work, we first observe that different methods trades of alignment and uniformity to varying degrees, and hypothesize that optimizing over supervised RAU loss alone is not sufficient for an optimal tradeoff. Then, by analyzing how BPR loss relates to the unsupervised contrastive loss where the supervised RAU loss stems from, we migrate the relation to propose our framework which aligns embeddings from both supervised and unsupervised perspectives while promoting user/item embedding uniformity on the hypersphere. Within the framework, we design a 0layer embedding perturbation to the neural network on the useritem bipartite graph for minimal yet sufficient data augmentation, discarding the traditional ones such as edge drop. Extensive experiments on three datasets show that our framework improves model performance and quickly converges to user/item embeddings. 
Zhongyu Ouyang · Shifu Hou · Chunhui Zhang · Chuxu Zhang · Yanfang Ye 🔗 


Robustness of Inverse Reinforcement Learning
(
Poster
)
link »
Reinforcement learning research experienced substantial jumps in its progress after the first achievement on utilizing deep neural networks to approximate the stateaction value function in highdimensional states. While deep reinforcement learning algorithms are currently being employed in many different tasks from industrial control to biomedical applications, the fact that an MDP has to provide a clear reward function limits the tasks that can be achieved via reinforcement learning. In this line of research, some studies proposed to directly learn a policy from observing expert trajectories (i.e. imitation learning), and others proposed to learn a reward function from expert demonstrations (i.e. inverse reinforcement learning). In this paper we will focus on robustness and vulnerabilities of deep imitation learning and deep inverse reinforcement learning policies. Furthermore, we will layout nonrobust features learnt by the deep inverse reinforcement learning policies. We conduct experiments in the Arcade Learning Environment (ALE), and compare the nonrobust features learnt by the deep inverse reinforcement learning algorithms to vanilla trained deep reinforcement learning policies. We hope that our study can provide a basis for the future discussions on the robustness of both deep inverse reinforcement learning and deep reinforcement learning. 
Ezgi Korkmaz 🔗 


Training Diffusion Models with Reinforcement Learning
(
Poster
)
link »
Diffusion models are a class of flexible generative models trained with an approximation to the loglikelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as humanperceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multistep decisionmaking problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative rewardweighted likelihood approaches. Empirically, DDPO is able to adapt texttoimage diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve promptimage alignment using feedback from a visionlanguage model without the need for additional data collection or human annotation. 
Kevin Black · Michael Janner · Yilun Du · Ilya Kostrikov · Sergey Levine 🔗 


Optimistic Thompson Sampling for NoRegret Learning in Unknown Games
(
Poster
)
link »

Yingru Li · Liangqi LIU · Wenqiang Pu · ZhiQuan Luo 🔗 


Extracting Reward Functions from Diffusion Models
(
Poster
)
link »
Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn highperforming policies in sequential decisionmaking tasks. Decisionmaking diffusion models can be trained on lowerquality data, and then be steered with a reward function to generate nearoptimal trajectories.We consider the problem of extracting a reward function by comparing a decisionmaking diffusion model that models lowreward behavior and one that models highreward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function  parametrized by a neural network  to the difference in outputs of both diffusion models.Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks.Finally, we demonstrate that our approach generalizes beyond sequential decisionmaking by learning a rewardlike function from two largescale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images. 
Felipe Nuti · Tim Franzmeyer · Joao Henriques 🔗 


Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning
(
Poster
)
link »
Large language models used in GPT4 and Alexa are limited by their ability to assess the validity of their own answers i.e., to fall back on a clarification intent when needed. Reinforcement learning can be used specifically to address this fallback selection problem, by adapting to semantic pitfalls of a given language model in a given environment. This is demonstrated in a simplified environment where the chatbot learns when best to ask for clarifications. After training it identifies correct intents in $<$ 2 interactions on average in over 99% of dialogues. In multiagent simulations where the user cooperates, the chatbot identifies correct intents in 1.3 interactions on average in 100% of dialogues.

Jeremy Curuksu 🔗 


QueryPolicy Misalignment in PreferenceBased Reinforcement Learning
(
Poster
)
link »
Preferencebased reinforcement learning (PbRL) provides a natural way to align RL agents’ behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counterintuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a longneglected issue in the query selection schemes of existing PbRL studies: QueryPolicy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents’ interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near onpolicy query and a specially designed hybrid experience replay, which together enforce the bidirectional querypolicy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing querypolicy misalignment in PbRL tasks. 
Xiao Hu · Jianxiong Li · Xianyuan Zhan · QingShan Jia · YaQin Zhang 🔗 


Distinguishing Feature Model for Learning From Pairwise Comparisons
(
Poster
)
link »
We consider the problem of learning to predict outcomes of unseen pairwise comparisons over a set of items when a small set of pairwise comparisons are available. When the underlying preferences are intransitive in nature, which is common occurrence in real world data sets, this becomes a challenging problem both in terms of modeling and learning. Towards this, we introduce a flexible and natural parametric model for pairwise comparisons that we call the \emph{Distinguishing Feature Model} (DF). Under this model, the items have an unknown but fixed embeddings and the pairwise comparison between a pair of items depends probabilistically on the feature in the embedding that can best distinguish the items. The proposed DF model generalizes the popular transitive BradleyTerryLuce model and with embedding dimension as low as $d = 3$, can capture arbitrarily long cyclic dependencies. Furthermore, we explicitly show the type of preference relations that cannot be modelled under the DF model for $d=3$. On the algorithmic side, we propose a Siamese style neural network architecture which can be used learn to predict well under the DF model while at the same time being interpretable in the sense that the embeddings learnt can be extracted directly from the learnt model. Our experimental results show that the model is either comparable or outperforms standard baselines in both synthetic and real world datasets.

Elisha Parhi · Arun Rajkumar 🔗 


Specifying Behavior Preference with Tiered Reward Functions
(
Poster
)
link »
Reinforcementlearning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to express which behaviors are preferable through designing reward functions. In this work, we consider the rewarddesign problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Then, we propose an environmentindependent tiered reward structure and show it is guaranteed to induce policies that are Paretooptimal according to our preference relation. 
Zhiyuan Zhou · Henry Sowerby · Michael L. Littman 🔗 


Who to imitate: Imitating desired behavior from diverse multiagent datasets
(
Poster
)
link »
AI agents are commonly trained with large datasets of unfiltered demonstrations of human behavior.However, not all behaviors are equally safe or desirable.We assume that desired traits for an AI agent can be approximated by a desired value function (DVF), that assigns scores to collective outcomes in the dataset.For example, in a dataset of vehicle interactions, the DVF might refer to the number of occurred incidents.We propose to first assess how well individual agents' behavior is aligned with the DVF, e.g., assessing how likely an agent is to cause incidents, to then only imitate agents with desired behaviors.To identify agents with desired behavior, we propose the concept of an agent's Exchange Value, which quantifies the expected change in collective value when substituting the agent into a random group. This concept is similar to Shapley Values used in Economics, but offers greater flexibility. We further introduce a variance maximization objective to compute Exchange Values from incomplete observations, effectively clustering agents by their unobserved traits. Using both human and simulated datasets, we learn aligned imitation policies that outperform relevant baselines. 
Tim Franzmeyer · Jakob Foerster · Edith Elkind · Phil Torr · Joao Henriques 🔗 


Competing Bandits in NonStationary Matching Markets
(
Poster
)
link »
Understanding complex dynamics of twosided online matching markets, where the demandside agents compete to match with the supplyside (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized twosided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demandside agents have unknown and different preferences over the supplyside (arms), but the arms have fixed and known preference over the agents. We propose and analyze an asynchronous and decentralized learning algorithm, namely NonStationary Competing Bandits (\texttt{NSCB}), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not \emph{dominated} by the higher ranked agents, which leads to \emph{forced exploration}. With carefully defined complexity parameters, we characterize this \emph{forced exploration} and obtain sublinear (logarithmic) regret of \texttt{NSCB}. Furthermore, we validate our theoretical findings via experiments. 
Avishek Ghosh · Abishek Sankararaman · Kannan Ramchandran · Tara Javidi · Arya Mazumdar 🔗 


Strategic Apple Tasting
(
Poster
)
link »
Algorithmic decisionmaking in highstakes domains often involves assigning decisions to agents with incentives to strategically modify their input to the algorithm. In addition to dealing with incentives, in many domains of interest (e.g. lending and hiring) the decisionmaker only observes feedback regarding their policy for rounds in which they assign a positive decision to the agent; this type of feedback is often referred to as apple tasting (or onesided) feedback. We formalize this setting as an online learning problem with appletasting feedback where a principal makes decisions about a sequence of $T$ agents, each of which is represented by a context that may be strategically modified. Our goal is to achieve sublinear strategic regret, which compares the performance of the principal to that of the best fixed policy in hindsight, if the agents were truthful when revealing their contexts. Our main result is a learning algorithm which incurs $\Tilde{\mathcal{O}}(\sqrt{T})$ strategic regret when the sequence of agents is chosen stochastically. We also give an algorithm capable of handling adversariallychosen agents, albeit at the cost of $\Tilde{\mathcal{O}}(T^{(d+1)/(d+2)})$ strategic regret (where $d$ is the dimension of the context). Our algorithms can be easily adapted to the setting where the principal receives bandit feedbackthis setting generalizes both the linear contextual bandit problem (by considering agents with incentives) and the strategic classification problem (by allowing for partial feedback).

Keegan Harris · Chara Podimata · Steven Wu 🔗 


Strategyproof DecisionMaking in Panel Data Settings and Beyond
(
Poster
)
link »
We consider the classical problem of decisionmaking using panel data, in which a decisionmaker gets noisy, repeated measurements of multiple units (or agents). We consider a setup where there is a preintervention period, when the principal observes the outcomes of each unit, after which the principal uses these observations to assign a treatment to each unit. Unlike this classical setting, we permit the units generating the panel data to be strategic, i.e. units may modify their preintervention outcomes in order to receive a more desirable intervention. The principal's goal is to design a strategyproof intervention policy, i.e. a policy that assigns units to their correct interventions despite their potential strategizing. We first identify a necessary and sufficient condition under which a strategyproof intervention policy exists, and provide a strategyproof mechanism with a simple closed form when one does exist. When there are two interventions, we establish that there always exists a strategyproof mechanism, and provide an algorithm for learning such a mechanism. For three or more interventions, we provide an algorithm for learning a strategyproof mechanism if there exists a sufficiently large gap in the principal's rewards between different interventions. Finally, we empirically evaluate our model using realworld panel data collected from product sales over 18 months. We find that our methods compare favorably to baselines which do not take strategic interactions into consideration, even in the presence of model misspecification. 
Keegan Harris · Anish Agarwal · Chara Podimata · Steven Wu 🔗 


Provable Offline Reinforcement Learning with Human Feedback
(
Poster
)
link »
In this paper, we investigate the problem of offline reinforcement learning with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new singlepolicy concentrability coefficient, which can be upper bounded by the pertrajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where stateactionwise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs. 
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun 🔗 


Contextual Bandits and Imitation Learning with PreferenceBased Active Queries
(
Poster
)
link »
We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is twofold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preferencebased feedback, our imitation learning algorithm can learn a policy outperforming a suboptimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.

Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu 🔗 


Inverse Game Theory for Stackelberg Games: the Blessing of Bounded Rationality
(
Poster
)
link »
Optimizing strategic decisions (a.k.a. computing equilibrium) is key to the success of many noncooperative multiagent applications. However, in many realworld situations, we may face the exact opposite of this gametheoretic problem  instead of prescribing equilibrium of a given game, we may directly observe the agents' equilibrium behaviors but want to infer the underlying parameters of an unknown game. This research question, also known as inverse game theory, has been studied in multiple recent works in the context of Stackelberg games. Unfortunately, existing works exhibit quite negative results, showing statistical hardness and computational hardness, assuming follower's perfectly rational behaviors. Our work relaxes the perfect rationality agent assumption to the classic quantal response model, a more realistic behavior model of bounded rationality. Interestingly, we show that the smooth property brought by such bounded rationality model actually leads to provably more efficient learning of the follower utility parameters in general Stackelberg games. Systematic empirical experiments on synthesized games confirm our theoretical results and further suggest its robustness beyond the strict quantal response model. 
Jibang Wu · Weiran Shen · Fei Fang · Haifeng Xu 🔗 


Rewarded soups: towards Paretooptimal alignment by interpolating weights finetuned on diverse rewards
(
Poster
)
link »
Foundation models are first pretrained on vast unsupervised datasets and then finetuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in realworld tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multipolicy strategy. Rather than focusing on a single a priori reward, we aim for Paretooptimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when finetuned on diverse rewards from a shared pretrained initialization. We demonstrate the effectiveness of our approach for texttotext (summarization, Q&A, helpful assistant, review), textimage (image captioning), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity. 
Alexandre Rame · Guillaume Couairon · Corentin Dancette · JeanBaptiste Gaya · Mustafa Shukor · Laure Soulier · Matthieu Cord 🔗 


Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism
(
Poster
)
link »
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the offpolicy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decisionmaking process with forwardlooking and bounded rationality.We propose a \underline{D}ynamic\underline{C}hoice\underline{P}essimistic\underline{P}olicy\underline{O}ptimization (DCPPO) method. \ The method involves a threestage process: The first step is to estimate the human behavior policy and the stateaction value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a nearoptimal policy. With only singlepolicy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO \textit{almost} matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for offpolicy offline RLHF with dynamic discrete choice model. 
Zihao Li 🔗 
Author Information
Aadirupa Saha (TTIC)
Bio: Aadirupa Saha is currently a visiting faculty at Toyota Technological Institute at Chicago (TTIC). She obtained her PhD from the Department of Computer Science, Indian Institute of Science, Bangalore, advised by Aditya Gopalan and Chiranjib Bhattacharyya. She spent two years at Microsoft Research New York City as a postdoctoral researcher. During her PhD, Aadirupa interned at Microsoft Research, Bangalore, Inria, Paris, and Google AI, Mountain View. Her research interests include Bandits, Reinforcement Learning, Optimization, Learning theory, Algorithms. She has organized various workshops, tutorials and also served as a reviewer in top ML conferences. Research Interests: Machine Learning Theory (specifically Online Learning, Bandits, Reinforcement Learning), Optimization, Game Theory, Algorithms. She is recently interested in exploring ML problems at the intersection of Fairness, Privacy, Game theory and Mechanism design.
Mohammad Ghavamzadeh (Google Research)
Robert BusaFekete (Google Research)
Branislav Kveton (AWS AI Labs)
Viktor Bengs (University of Munich)
More from the Same Authors

2022 : Nonstationary Bandits and MetaLearning with a Small Set of Optimal Arms »
MohammadJavad Azizi · Thang Duong · Yasin AbbasiYadkori · Claire Vernade · Andras Gyorgy · Mohammad Ghavamzadeh 
2023 : Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation »
Thomas Kleine Büning · Aadirupa Saha · Christos Dimitrakakis · Haifeng Xu 
2023 : Active Learning with Crowd Sourcing Improves Information Retrieval »
Zhuotong Chen · Yifei Ma · Branislav Kveton · Anoop Deoras 
2023 Poster: Federated Online and Bandit Convex Optimization »
Kumar Kshitij Patel · Lingxiao Wang · Aadirupa Saha · Nati Srebro 
2023 Poster: Thompson Sampling with Diffusion Generative Prior »
YuGuan Hsieh · Shiva Kasiviswanathan · Branislav Kveton · Patrick Bloebaum 
2023 Poster: Multiplier Bootstrapbased Exploration »
Runzhe Wan · Haoyu Wei · Branislav Kveton · Rui Song 
2023 Poster: MultiTask OffPolicy Learning from Bandit Feedback »
Joey Hong · Branislav Kveton · Manzil Zaheer · Sumeet Katariya · Mohammad Ghavamzadeh 
2023 Poster: On SecondOrder Scoring Rules for Epistemic Uncertainty Quantification »
Viktor Bengs · Eyke Hüllermeier · Willem Waegeman 
2022 Workshop: Complex feedback in online learning »
Rémy Degenne · Pierre Gaillard · Wouter Koolen · Aadirupa Saha 
2022 Poster: Safe Exploration for Efficient Policy Evaluation and Comparison »
Runzhe Wan · Branislav Kveton · Rui Song 
2022 Poster: Deep Hierarchy in Bandits »
Joey Hong · Branislav Kveton · Sumeet Katariya · Manzil Zaheer · Mohammad Ghavamzadeh 
2022 Poster: Versatile Dueling Bandits: Bestofboth World Analyses for Learning from Relative Preferences »
Aadirupa Saha · Pierre Gaillard 
2022 Spotlight: Versatile Dueling Bandits: Bestofboth World Analyses for Learning from Relative Preferences »
Aadirupa Saha · Pierre Gaillard 
2022 Spotlight: Deep Hierarchy in Bandits »
Joey Hong · Branislav Kveton · Sumeet Katariya · Manzil Zaheer · Mohammad Ghavamzadeh 
2022 Spotlight: Safe Exploration for Efficient Policy Evaluation and Comparison »
Runzhe Wan · Branislav Kveton · Rui Song 
2022 Poster: Feature and Parameter Selection in Stochastic Linear Bandits »
Ahmadreza Moradipari · Berkay Turan · Yasin AbbasiYadkori · Mahnoosh Alizadeh · Mohammad Ghavamzadeh 
2022 Spotlight: Feature and Parameter Selection in Stochastic Linear Bandits »
Ahmadreza Moradipari · Berkay Turan · Yasin AbbasiYadkori · Mahnoosh Alizadeh · Mohammad Ghavamzadeh 
2022 Poster: Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models »
Viktor Bengs · Aadirupa Saha · Eyke Hüllermeier 
2022 Poster: Optimal and Efficient Dynamic Regret Algorithms for NonStationary Dueling Bandits »
Aadirupa Saha · Shubham Gupta 
2022 Spotlight: Optimal and Efficient Dynamic Regret Algorithms for NonStationary Dueling Bandits »
Aadirupa Saha · Shubham Gupta 
2022 Spotlight: Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models »
Viktor Bengs · Aadirupa Saha · Eyke Hüllermeier 
2021 Poster: MetaThompson Sampling »
Branislav Kveton · Mikhail Konobeev · Manzil Zaheer · Chihwei Hsu · Martin Mladenov · Craig Boutilier · Csaba Szepesvari 
2021 Spotlight: MetaThompson Sampling »
Branislav Kveton · Mikhail Konobeev · Manzil Zaheer · Chihwei Hsu · Martin Mladenov · Craig Boutilier · Csaba Szepesvari 
2021 Poster: ConfidenceBudget Matching for Sequential Budgeted Learning »
Yonathan Efroni · Nadav Merlis · Aadirupa Saha · Shie Mannor 
2021 Poster: Adversarial Dueling Bandits »
Aadirupa Saha · Tomer Koren · Yishay Mansour 
2021 Spotlight: ConfidenceBudget Matching for Sequential Budgeted Learning »
Yonathan Efroni · Nadav Merlis · Aadirupa Saha · Shie Mannor 
2021 Spotlight: Adversarial Dueling Bandits »
Aadirupa Saha · Tomer Koren · Yishay Mansour 
2021 Poster: Optimal regret algorithm for Pseudo1d Bandit Convex Optimization »
Aadirupa Saha · Nagarajan Natarajan · Praneeth Netrapalli · Prateek Jain 
2021 Spotlight: Optimal regret algorithm for Pseudo1d Bandit Convex Optimization »
Aadirupa Saha · Nagarajan Natarajan · Praneeth Netrapalli · Prateek Jain 
2021 Poster: Dueling Convex Optimization »
Aadirupa Saha · Tomer Koren · Yishay Mansour 
2021 Spotlight: Dueling Convex Optimization »
Aadirupa Saha · Tomer Koren · Yishay Mansour 
2020 Poster: From PAC to InstanceOptimal Sample Complexity in the PlackettLuce Model »
Aadirupa Saha · Aditya Gopalan 
2020 Poster: Improved Sleeping Bandits with Stochastic Action Sets and Adversarial Rewards »
Aadirupa Saha · Pierre Gaillard · Michal Valko 
2020 Poster: Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems »
Tong Yu · Branislav Kveton · Zheng Wen · Ruiyi Zhang · Ole J. Mengshoel 
2019 Poster: Garbage In, Reward Out: Bootstrapping Exploration in MultiArmed Bandits »
Branislav Kveton · Csaba Szepesvari · Sharan Vaswani · Zheng Wen · Tor Lattimore · Mohammad Ghavamzadeh 
2019 Oral: Garbage In, Reward Out: Bootstrapping Exploration in MultiArmed Bandits »
Branislav Kveton · Csaba Szepesvari · Sharan Vaswani · Zheng Wen · Tor Lattimore · Mohammad Ghavamzadeh 
2017 Poster: ModelIndependent Online Learning for Influence Maximization »
Sharan Vaswani · Branislav Kveton · Zheng Wen · Mohammad Ghavamzadeh · Laks V.S Lakshmanan · Mark Schmidt 
2017 Poster: Online Learning to Rank in Stochastic Click Models »
Masrour Zoghi · Tomas Tunys · Mohammad Ghavamzadeh · Branislav Kveton · Csaba Szepesvari · Zheng Wen 
2017 Talk: Online Learning to Rank in Stochastic Click Models »
Masrour Zoghi · Tomas Tunys · Mohammad Ghavamzadeh · Branislav Kveton · Csaba Szepesvari · Zheng Wen 
2017 Talk: ModelIndependent Online Learning for Influence Maximization »
Sharan Vaswani · Branislav Kveton · Zheng Wen · Mohammad Ghavamzadeh · Laks V.S Lakshmanan · Mark Schmidt