Workshop
Complex feedback in online learning
Rémy Degenne · Pierre Gaillard · Wouter Koolen · Aadirupa Saha
Room 314  315
Sat 23 Jul, 5:45 a.m. PDT
While online learning has become one of the most successful and studied approaches in machine learning, in particular with reinforcement learning, online learning algorithms still interact with their environments in a very simple way.The complexity and diversity of the feedback coming from the environment in real applications is often reduced to the observation of a scalar reward. More and more researchers now seek to exploit fully the available feedback to allow faster and more humanlike learning.This workshop aims to present a broad overview of the feedback types being actively researched, highlight recent advances and provide a networking forum for researchers and practitioners.
Schedule
Sat 5:45 a.m.  6:00 a.m.

Opening remarks
(
Remarks
)
>

🔗 
Sat 6:00 a.m.  6:30 a.m.

Learning from Preference Feedback in Combinatorial Action Spaces
(
Invited Speaker
)
>
SlidesLive Video The feedback that users provide through their choices (e.g. clicks, purchases, streams) is one of the most common types of data readily available for training autonomous systems. However, naively training systems based on choice data has many pitfalls, since the observed choices are only boundedly rational at best  leading to biased training data and consequently biased learning results. Instead of correcting for these biases posthoc, this talk explores how specifically designed informationacquisition interventions can eliminate biases during data collection and provide reliable feedback in the form of pairwise preferences. I will discuss how to use this preference feedback in the Dueling Bandits framework and the Coactive Learning framework, highlighting how these two frameworks differ in their reliance on humandriven exploration versus algorithmdriven exploration. 
Thorsten Joachims 🔗 
Sat 6:30 a.m.  7:00 a.m.

Delayed Feedback in Generalised Linear Bandits Revisited
(
Invited Speaker
)
>
SlidesLive Video In this talk, I will revisit the stochastic generalised linear bandit problem with stochastically delayed rewards. For this problem, we show that it is possible to obtain improved regret bounds by using a standard optimistic algorithm which only uses data from the rewards which have been received. In contrast to prior work, we show that no inflation of the confidence sets due to the delays is required. This leads to improving the regret guarantees from Õ(d \sqrt{T} + \sqrt{dT} E[𝛕]) to Õ(d \sqrt{T} + d^{3/2}E[𝛕]), where E[𝛕] denotes the expected delay, d is the dimension and T the time horizon and we have suppressed logarithmic terms. Thus our results decouple the impact of the horizon and delay, and more closely match what has been seen in the stochastically delayed Karmed bandit setting. We verify our theoretical results through experiments on simulated data. This is joint work with Ben Howson and Sarah Filippi. 
Ciara PikeBurke 🔗 
Sat 7:00 a.m.  7:30 a.m.

Break
(
Break
)
>

🔗 
Sat 7:30 a.m.  8:00 a.m.

Online learning in digital markets
(
Invited speaker
)
>
SlidesLive Video The explosive growth of online markets has created complex ecosystems of algorithmic agents. To optimize their revenue, agents need to understand how the market works, and to do so they often resort to strategies that learn from past observations. In this talk, we describe some recent results characterizing the strengths and limitations of sequential decisionmaking approaches applied to various problems arising in digital markets. These include dynamic pricing, bilateral trading, supply chain management, and adaptive taxation. The analysis sheds light on how the learning rates depend on the interplay between the form of the revenue function and the feedback provided during the learning process. 
Nicolò CesaBianchi 🔗 
Sat 8:00 a.m.  8:30 a.m.

Beyond Learning from Demonstrations
(
Invited Speaker
)
>
SlidesLive Video A typical way to teach robots new behaviors involves offline learning from demonstrations. However, there are many other kinds of input that humans can give their robots that are often faster and more intuitive to communicate. In the first part of the talk, I will discuss one such input type that the robot can learn from online: physical corrections. In the second part of the talk, I will argue that moving beyond learning from demonstrations requires more than just thinking about other human feedback types for teaching tasks: we also need to consider new learning pipelines to facilitate seamless and aligned interaction between humans and robots. 
Andreea Bobu 🔗 
Sat 8:30 a.m.  8:50 a.m.

Nearoptimal Regret for Adversarial MDP with Delayed Bandit Feedback
(
Oral
)
>
SlidesLive Video
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve nearoptimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Tiancheng Jin · Tal Lancewicki · Haipeng Luo · Yishay Mansour · Aviv Rosenberg 🔗 
Sat 8:50 a.m.  9:10 a.m.

Contextual Inverse Optimization: Offline and Online Learning
(
Oral
)
>
SlidesLive Video We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, afterthefact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an allknowing oracle. In the offline setting, the decisionmaker has information available from past periods and needs to make one decision, while in the online setting, the decisionmaker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon. 
Omar Besbes · Yuri Fonseca · Ilan Lobel 🔗 
Sat 9:10 a.m.  10:30 a.m.

Lunch Break
(
Break
)
>

🔗 
Sat 10:30 a.m.  11:00 a.m.

Decentralized Learning in Online Queuing Systems
(
Invited Speaker
)
>
link
SlidesLive Video Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is \textit{stable}) as long as the ratio between service rates and arrival rates is larger than 1. In the decentralized case, individual noregret strategies ensures stability when this ratio is larger than 2. Yet, myopically minimizing regret disregards the long term effects due to the carryover of packets to further rounds. On the other hand, minimizing long term costs leads to stable Nash equilibria as soon as the ratio exceeds ee−1. Stability with decentralized learning strategies with a ratio below 2 was a major remaining question. We first argue that for ratios up to 2, cooperation is required for stability of learning strategies, as selfish minimization of policy regret, a \textit{patient} notion of regret, might indeed still be unstable in this case. We therefore consider cooperative queues and propose the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than 1, thus reaching performances comparable to centralized strategies. 
Vianney Perchet 🔗 
Sat 11:00 a.m.  11:20 a.m.

Giving Complex Feedback in Online Student Learning with MetaExploration
(
Oral
)
>
SlidesLive Video Creating interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is hard — standard approaches require instructors to hand grade studentimplemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. Recent work proposes to train reinforcement learning agents to interact with a student’s program, aiming to explore states indicative of errors. However, this approach only provides binary feedback of whether a program is correct or not, while students require finergrained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a metaexploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides finegrained feedback. We evaluate our approach on a set of 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by over 18.8% and coming within 1.5% of humanlevel accuracy. 
Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn 🔗 
Sat 11:20 a.m.  11:40 a.m.

Threshold Bandit Problem with Link Assumption between Pulls and Duels
(
Oral
)
>
SlidesLive Video Building on prior work (Xu et al., 2020), we propose a newer version of the RankSearch algorithm, which aims to solve the Threshold BanditProblem when both duels and pulls are possible.Our algorithm, which requires an additional assumption connecting the borda scores of armswith their mean rewards, is able to perform in settings in which previous RankSearch algorithmscannot. We conduct experiments comparing theperformance of the new algorithm with that ofolder versions of RankSearch in various settings.Finally we prove upper bounds for the total number of duels and pulls required by our proposedalgorithm, which we call Rank Search with Link(RSL). 
Keshav Narayan · Aarti Singh 🔗 
Sat 11:40 a.m.  12:00 p.m.

Metalearning from Learning Curves Challenge: Lessons learned from the First Round and Design of the Second Round
(
Oral
)
>
SlidesLive Video Metalearning from learning curves is an important yet often neglected research area in the Machine Learning community. We introduce a series of Reinforcement Learningbased metalearning challenges, in which an agent searches for the best suited algorithm for a given dataset, based on feedback of learning curves from the environment. The first round attracted participants both from academia and industry. This paper analyzes the results of the first round (accepted to the competition program of Anonymousconference1), to draw insights into what makes a metalearner successful at learning from learning curves. With the lessons learned from the first round and the feedback from the participants, we have designed the second round of our challenge with a new protocol and a new metadataset. The second round of our challenge is accepted at the Anonymousconference2 and currently ongoing. 
Manh Hung Nguyen · Lisheng Sun · Nathan Grinsztajn · Isabelle Guyon 🔗 
Sat 12:00 p.m.  12:30 p.m.

Break
(
Break
)
>

🔗 
Sat 12:30 p.m.  1:30 p.m.

Poster session
(
Poster Session
)
>

🔗 
Sat 1:30 p.m.  2:00 p.m.

Prescriptive solutions in games: from theory to scale
(
Invited Speaker
)
>
SlidesLive Video This presentation will first motivate games as a benchmark for studying environments with complex feedback. Then it will shortly review the main game theory solution concepts that can be used as an objective to learn in games. Finally, the presentation will highlight a few recent algorithmic advances and results in 2 classes of games (imperfect information games and mean field games). 
Julien Perolat 🔗 
Sat 2:00 p.m.  2:20 p.m.

ActiveHedge: Hedge meets Active Learning
(
Oral
)
>
SlidesLive Video
We consider the classical problem of multiclass prediction with expert advice, but with an active learning twist. In this new setting the learner will only query the labels of a small number of examples, but still aims to minimize regret to the best expert as usual; the learner is also allowed a very short ``burnin'' phase where it can fastforward and query certain highlyinformative examples. We design an algorithm that utilizes Hedge (aka Exponential Weights) as a subroutine, and we show that under a very particular combinatorial constraint on the matrix of expert predictions we can obtain a very strong regret guarantee while querying very few labels. This constraint, which we refer to as $\zeta$compactness, or just compactness, can be viewed as a nonstochastic variant of the disagreement coefficient, another popular parameter used to reason about the sample complexity of active learning in the IID setting. We also give a polynomial time algorithm to calculate the $\zeta$compactness of a matrix up to an approximation factor of 3.

Bhuvesh Kumar · Jacob Abernethy · Venkatesh Saligrama 🔗 
Sat 2:20 p.m.  2:30 p.m.

Closing remarks
(
Remarks
)
>

🔗 


Optimal Parameterfree Online Learning with Switching Cost
(
Poster
)
>
Parameterfreeness in online learning refers to the adaptivity of an algorithm with respect to the optimal decision in hindsight. In this paper, we design such algorithms in the presence of switching cost  the latter penalizes the optimistic updates required by parameterfreeness, leading to a delicate design tradeoff. Based on a novel dual space scaling strategy, we propose a simple yet powerful algorithm for Online Linear Optimization (OLO) with switching cost, which improves the existing suboptimal regret bound (Zhang et al., 2022a) to the optimal rate. The obtained benefit is extended to the expert setting, and the practicality of our algorithm is demonstrated through a sequential investment task. 
Zhiyu Zhang · Ashok Cutkosky · Ioannis Paschalidis 🔗 


Challenging Common Assumptions in Convex Reinforcement Learning
(
Poster
)
>
The classic Reinforcement Learning (RL) formulation is concerned with maximizing a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex function of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, riskaverse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide, and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and realworld RL, we believe that shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, riskaverse RL, and pure exploration among others. 
Mirco Mutti · Riccardo De Santi · Piersilvio De Bartolomeis · Marcello Restelli 🔗 


Nonstationary Bandits and MetaLearning with a Small Set of Optimal Arms
(
Poster
)
>
SlidesLive Video
We study a sequential decision problem where the learner faces a sequence of $K$armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit metalearning setting), or unknown (the nonstationary bandit setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that, in the regime of large number of tasks and small number of optimal arms, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for nonstationary bandit problems. For the bandit metalearning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit metalearning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.

MohammadJavad Azizi · Thang Duong · Yasin AbbasiYadkori · Claire Vernade · Andras Gyorgy · Mohammad Ghavamzadeh 🔗 


Provably Correct SGDbased Exploration for Linear Bandit
(
Poster
)
>
SlidesLive Video
Linear bandits are commonly applied to applications with sequential decisionmaking, where balancing the exploration and exploitation is of paramount importance. Yet existing approaches, e.g., upper confidence bound (UCB) algorithms, usually suffer from large computation complexity per round, can be overly pessimistic, and usually make sudden updates to the learned policies. In this paper, we take an online stochastic gradient descent (SGD) based approach to perform incremental updates on the estimated parameters. Our approach not only saves memory and computation but also gives smooth update to the learned policy. Theoretically, we prove that our proposed algorithm can achieve $\tilde O(d\sqrt{n})$ regret, where $n$ is the number of total time steps of the algorithm, matching existing nearoptimal regret bounds in UCBtype algorithms. Furthermore, our approach does not require the ``diversity'' conditions, which bound the minimum eigenvalues of the covariance matrix, as presented in related literature. We further conduct experiments to demonstrate the consistently superb performance of our algorithms.

Jialin Dong · Lin Yang 🔗 


You Only Live Once: SingleLife Reinforcement Learning via Learned Reward Shaping
(
Poster
)
>
Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, typically starting from scratch. However, many realworld situations operate under a different set of assumptions: the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once, while leveraging some prior knowledge or experience. For example, imagine a robot that is exploring another planet, where it cannot get help or supervision from humans. If it needs to navigate to a crater that it has never seen before in search of water, it only needs to reach this particular crater once. It must do so without the benefit of episodic resets and tackle a new, unknown terrain, but it can leverage prior experience it acquired on Earth. We formalize this problem setting, which we call singlelife reinforcement learning (SLRL), where an agent must complete a task once while contending with some form of novelty in a single trial without interventions, given some prior data. In this setting, we find that algorithms designed for standard episodic reinforcement learning can struggle, as they have trouble recovering from novel states especially when informative rewards are not provided. Motivated by this observation, we also propose an algorithm, Qweighted adversarial learning (QWALE), which employs a distribution matching strategy that leverages the agent's prior experience as guidance in novel situations. Our experiments on several singlelife continuous control problems indicate that methods based on our distribution matching formulation are 2060% more successful because they can more quickly recover from novel, outofdistribution states. 
Annie Chen · Archit Sharma · Sergey Levine · Chelsea Finn 🔗 


Stochastic Rising Bandits for Online Model Selection
(
Poster
)
>
SlidesLive Video
This paper is in the field of stochastic MultiArmed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms’ expected payoff is monotonically nondecreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (RedUCB) and one for the restless case (RlessUCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\mathcal{O}(T^{2/3})$. We empirically compare our algorithms with stateoftheart methods for nonstationary MABs over several synthetically generated tasks and an online model selection problem for a realworld dataset. Finally, using synthetic and realworld data, we illustrate the effectiveness of the proposed approaches compared with stateoftheart algorithms for the nonstationary bandits.

Alberto Maria Metelli · Francesco Trovò · Matteo Pirola · Marcello Restelli 🔗 


Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP
(
Poster
)
>
SlidesLive Video
We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and contextdependent. For the latter, our algorithm obtains $ \max\{H,\frac{1}{p_{min}}\}HS^{3/2}\sqrt{AT\log\frac{\max\{\mathcal{F},\mathcal{P}\}}{\delta}} $ regret bound, up to polylogarithmic factors, with probability $1\delta$, where $\Fp$ and $\F$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of $\Omega(\sqrt{T H S A \ln(\mathcal{F}/S)/\ln(A)})$, on the expected regret which holds even in the case of known dynamics.

Orin Levy · Yishay Mansour 🔗 


Strategies for Safe MultiArmed Bandits with Logarithmic Regret and Risk
(
Poster
)
>
SlidesLive Video We investigate a natural but surprisingly unstudied approach to the multiarmed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk.We formulate a pseudoregret for this setting that enforces this safety constraint in a perround way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense.We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gapdependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate. 
Tianrui Chen · Aditya Gangrade · Venkatesh Saligrama 🔗 


Provably FeedbackEfficient Reinforcement Learning via Active Reward Learning
(
Poster
)
>
SlidesLive Video
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL). Yet, it is known to be extremely challenging in practice to design a correct reward function for even simple tasks.Humanintheloop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback. However, despite achieving great empirical successes, HiL RL usually requires \emph{too much} feedback from a human teacher and also suffers from insufficient theoretical understanding.In this paper, we focus on addressing this issue from a theoretical perspective, aiming to provide provably feedbackefficient algorithmic frameworks that take humanintheloop to specify rewards of given tasks.We provide an \emph{activelearning}based RL algorithm that first explores the environment without specifying a reward function and then asks a human teacher for only a few queries about the rewards of a task at some stateaction pairs. After that, the algorithm guarantees to provide a nearly optimal policy for the task with high probability. We show that, even with the presence of random noise in the feedback, the algorithm only takes $\tilde{O}(H\dim_{R})$ queries on the reward function to provide an $\epsilon$optimal policy for any $\epsilon > 0$. Here $H$ is the horizon of the RL environment, and $\dim_{R}$ specifies the complexity of the function class representing the reward function. In contrast, standard RL algorithms require to query the reward function for at least $\Omega(\poly(d, 1/\epsilon))$ stateaction pairs where $d$ depends on the complexity of the environmental transition.

Dingwen Kong · Lin Yang 🔗 


Dynamical Linear Bandits for LongLasting Vanishing Rewards
(
Poster
)
>
In many realworld sequential decisionmaking problems, an action might not immediately reflect on the feedback and spread its effects over a long time horizon. For instance, in online advertising, investing in a platform produces an increase in awareness, but the actual reward (e.g., a conversion) might occur far in the future. Furthermore, whether a conversion takes place depends on several factors: how fast the awareness grows; possible awareness vanishing effects; synergy or interference with other advertising platforms. Previous work has investigated the MultiArmed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates into the future, disregarding possible hidden dynamical effects. In this paper, we introduce a novel setting, Dynamical Linear Bandits (DLB), an extension of linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and the action, and the hidden state evolves according to a linear dynamics. In this way, the effects of each action are delayed by the system evolution, persist over time, and the interplay between the action components is taken into account. After introducing the setting and discussing the notion of optimal policy, we provide an anytime optimistic regret minimization algorithm Dynamical Linear Upper Confidence Bound (DynLinUCB) that suffers regret of order O(c d sqrt(T)), where c is a constant dependent on the properties of the linear dynamical evolution. Finally, we conduct a numerical validation on a synthetic environment to show the effectiveness of DynLinUCB in comparison with bandit baselines. 
Marco Mussi · Alberto Maria Metelli · Marcello Restelli 🔗 


Online Learning with OffPolicy Feedback
(
Poster
)
>
SlidesLive Video We study the problem of online learning in adversarial bandit problems under a partial observability model called offpolicy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard explorationexploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are wellcovered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of offpolicy reinforcement learning. 
Germano Gabbianelli · Matteo Papini · Gergely Neu 🔗 


On Adaptivity and Confounding in Contextual Bandit Experiments
(
Poster
)
>
Multiarmed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding factors or delayed feedback. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when dayofweek effects may confound inferences. In response, this paper formulates a general model of contextual bandit experiments with nonstationary contexts, which act as the confounders for inferences and can be also viewed as the distribution shifts in the earlier periods of the experiments. In addition, this general model allows the target distribution or population distribution that is used to determine the best action to be different from the empirical distribution over the contexts observed during the experiments. The paper proposes \emph{deconfounded Thompson sampling}, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action  suggesting optimal adaptivity  but also satisfies strong performance guarantees in the presence of dayofweek effects and delayed observations  suggesting unusual robustness. 
Chao Qin · Daniel Russo 🔗 


Unimodal MonoPartite Matching in a Bandit Setting
(
Poster
)
>
We tackle a new emerging problem, which is finding an optimal monopartite matching in a weighted graph. The semibandit version, where a full matching is sampled at each iteration, has been addressed by (Sentenac et al., 2021), creating an algorithm with an expected regret matching O(L log(L)/∆ log(T)) with 2L players, T iterations and a minimum reward gap ∆. We reduce this bound in two steps. First, as in (Gauthier et al.,2021) and (Gauthier et al., 2022) we use the unimodality property of the expected reward on the appropriate graph to design an algorithm with a regret in O(L /∆ log(T)). Secondly, we show that by moving the focus towards the main question ‘Is user i better than user j?’ this regret becomes O(L ∆ / D^2 log(T)), where D>∆ derives from a better way of comparing users. Some experimental results finally show these theoretical results are corroborated in practice. 
Matthieu Rodet · Romaric Gaudel 🔗 


Beyond IID: datadriven decisionmaking in heterogeneous environments
(
Poster
)
>
In this work, we study datadriven decisionmaking and depart from the classical identically and independently distributed (i.i.d.) assumption. We present a new framework in which historical samples are generated from unknown and different distributions, which we dub heterogeneous environments. These distributions ar e assumed to lie in a heterogeneity ball with known radius and centered around the (also) unknown future (outofsample) distribution on which the performance of a decision will be evaluated. We quantify the asymptotic worstcase regret that is achievable by central datadriven policies such as Sample Average Approximation, but also by rateoptimal ones, as a function of the radius of the heterogeneity ball. Our work shows that the type of achievable performance varies considerably across different combinations of problem classes and notions of heterogeneity. We demonstrate the versatility of our framework by comparing achievable guarantees for the heterogeneous version of widely studied datadriven problems such as pricing, skirental, and newsvendor. En route, we establish a new connection between datadriven decisionmaking and distributionally robust optimization. 
Omar Besbes · Will Ma · Omar Mouchtaki 🔗 


Big Control Actions Help Multitask Learning of Unstable Linear Systems
(
Poster
)
>
SlidesLive Video A fundamental problem in one of the most popular models for continuous environments; i.e., linear dynamical systems, is to learn dynamics matrices from unstable state trajectories. While this problem is wellstudied for a single system, little is known about multitask learning methods that utilize potential commonalities to estimate the dynamics matrices more accurately. The longstanding obstacle is that idiosyncratic instabilities nullify the benefit of sharing data across systems. We address this issue by introducing the new method of \emph{big random control actions}, and develop a novel performance analysis for that. To the authors' knowledge, this is the first theoretical guarantee for multitask learning under instability when system matrices are \emph{unknown} linear combinations of \emph{unknown} bases matrices. The techniques can be extended to multitask learning problems in other settings with nonstationary or temporally dependent data. 
Aditya Modi · Ziping Xu · Mohamad Kazem Shirani Faradonbeh · Ambuj Tewari 🔗 


Adversarial Attacks Against Imitation and Inverse Reinforcement Learning
(
Poster
)
>
Learning from raw highdimensional observations became possible with the help of deep neural networks. With this initial enhancement the progress of reinforcement learning research is experiencing one of its highest peaks. The policies trained with deep reinforcement learning are being deployed in many different settings from medical to industrial control. Yet the fact that reinforcement learning still needs a reward function to learn functioning policies can be restrictive for certain types of tasks. To be able to learn in these tasks several studies focused on different ways of learning by observing a set of trajectories from an optimal policy. One line of research on this focuses on learning a reward function from a set of observations, referred to as inverse reinforcement learning, and another line focuses on learning an optimal policy from the observations of trajectories, referred to as imitation learning. In our paper we investigate the robustness of the stateoftheart deep imitation learning policies and deep inverse reinforcement learning policies towards adversarial vectors. We demonstrate that the simple vanilla trained deep reinforcement learning policies are more robust compared to deep inverse reinforcement learning and deep imitation learning policies trained in complex state representation MDPs. 
Ezgi Korkmaz 🔗 


Choosing Answers in EpsilonBestAnswer Identification for Linear Bandits
(
Poster
)
>
SlidesLive Video
In pureexploration problems, information is gathered sequentially to answer a question on the stochastic environment.While bestarm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$close to the best one (and not exactly the best one).In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct.We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity.Instead, a \textit{furthest answer} should be identified.Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt bestarm identification algorithms to tackle $\varepsilon$bestanswer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified bestarm identification algorithms.

Marc Jourdan · Rémy Degenne 🔗 


InteractionGrounded Learning with Actioninclusive Feedback
(
Poster
)
>
SlidesLive Video Consider the problem setting of InteractionGrounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL’s success in many potential scenarios such as Braincomputer interface (BCI) or Humancomputer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and largescale experiments based on supervised datasets to demonstrate the effectiveness of the new approach. 
Tengyang Xie · Akanksha Saran · Dylan Foster · Lekan Molu · Ida Momennejad · Nan Jiang · Paul Mineiro · John Langford 🔗 


On the Importance of Critical Period in Multistage Reinforcement Learning
(
Poster
)
>
SlidesLive Video The initial years of an infant's life are known as the critical period, during which the overall development of learning performance is significantly impacted due to neural plasticity. In recent studies, an AI agent, with a deep neural network mimicking mechanisms of actual neurons, exhibited a learning period similar to human's critical period. Especially during this initial period, the appropriate stimuli play a vital role in developing learning ability. However, transforming human cognitive bias into an appropriate faping reward is quite challenging, and prior works on critical period do not focus on finding the appropriate stimulus. To take a step further, we propose multistage reinforcement learning to emphasize finding "appropriate stimulus" around the critical period. Inspired by humans' early cognitivedevelopmental stage, we use multistage guidance near the critical period, and demonstrate the appropriate shaping reward (stage2 guidance) in terms of the AI agent's performance, efficiency, and stability. 
Junseok Park · Inwoo Hwang · Min Whoo Lee · Hyunseok Oh · Minsu Lee · Youngki Lee · ByoungTak Zhang 🔗 


ActiveHedge: Hedge meets Active Learning
(
Poster
)
>
SlidesLive Video
We consider the classical problem of multiclass prediction with expert advice, but with an active learning twist. In this new setting the learner will only query the labels of a small number of examples, but still aims to minimize regret to the best expert as usual; the learner is also allowed a very short ``burnin'' phase where it can fastforward and query certain highlyinformative examples. We design an algorithm that utilizes Hedge (aka Exponential Weights) as a subroutine, and we show that under a very particular combinatorial constraint on the matrix of expert predictions we can obtain a very strong regret guarantee while querying very few labels. This constraint, which we refer to as $\zeta$compactness, or just compactness, can be viewed as a nonstochastic variant of the disagreement coefficient, another popular parameter used to reason about the sample complexity of active learning in the IID setting. We also give a polynomial time algorithm to calculate the $\zeta$compactness of a matrix up to an approximation factor of 3.

Bhuvesh Kumar · Jacob Abernethy · Venkatesh Saligrama 🔗 


Nearoptimal Regret for Adversarial MDP with Delayed Bandit Feedback
(
Poster
)
>
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve nearoptimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Tiancheng Jin · Tal Lancewicki · Haipeng Luo · Yishay Mansour · Aviv Rosenberg 🔗 


Giving Complex Feedback in Online Student Learning with MetaExploration
(
Poster
)
>
SlidesLive Video Creating interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is hard — standard approaches require instructors to hand grade studentimplemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. Recent work proposes to train reinforcement learning agents to interact with a student’s program, aiming to explore states indicative of errors. However, this approach only provides binary feedback of whether a program is correct or not, while students require finergrained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a metaexploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides finegrained feedback. We evaluate our approach on a set of 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by over 18.8% and coming within 1.5% of humanlevel accuracy. 
Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn 🔗 


Metalearning from Learning Curves Challenge: Lessons learned from the First Round and Design of the Second Round
(
Poster
)
>
Metalearning from learning curves is an important yet often neglected research area in the Machine Learning community. We introduce a series of Reinforcement Learningbased metalearning challenges, in which an agent searches for the best suited algorithm for a given dataset, based on feedback of learning curves from the environment. The first round attracted participants both from academia and industry. This paper analyzes the results of the first round (accepted to the competition program of Anonymousconference1), to draw insights into what makes a metalearner successful at learning from learning curves. With the lessons learned from the first round and the feedback from the participants, we have designed the second round of our challenge with a new protocol and a new metadataset. The second round of our challenge is accepted at the Anonymousconference2 and currently ongoing. 
Manh Hung Nguyen · Lisheng Sun · Nathan Grinsztajn · Isabelle Guyon 🔗 


Threshold Bandit Problem with Link Assumption between Pulls and Duels
(
Poster
)
>
SlidesLive Video Building on prior work (Xu et al., 2020), we propose a newer version of the RankSearch algorithm, which aims to solve the Threshold BanditProblem when both duels and pulls are possible.Our algorithm, which requires an additional assumption connecting the borda scores of armswith their mean rewards, is able to perform in settings in which previous RankSearch algorithmscannot. We conduct experiments comparing theperformance of the new algorithm with that ofolder versions of RankSearch in various settings.Finally we prove upper bounds for the total number of duels and pulls required by our proposedalgorithm, which we call Rank Search with Link(RSL). 
Keshav Narayan · Aarti Singh 🔗 


Contextual Inverse Optimization: Offline and Online Learning
(
Poster
)
>
SlidesLive Video We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, afterthefact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an allknowing oracle. In the offline setting, the decisionmaker has information available from past periods and needs to make one decision, while in the online setting, the decisionmaker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon. 
Omar Besbes · Yuri Fonseca · Ilan Lobel 🔗 