Timezone: »
Systems that can learn interactively from their end-users are quickly becoming widespread in real-world applications. Typically humans provide tagged rewards or scalar feedback for such interactive learning systems. However, humans offer a wealth of implicit information (such as multimodal cues in the form of natural language, speech, eye movements, facial expressions, gestures etc.) which interactive learning algorithms can leverage during the process of human-machine interaction to create a grounding for human intent, and thereby better assist end-users. A closed-loop sequential decision-making domain offers unique challenges when learning from humans -– (1) the data distribution may be influenced by the choices of the algorithm itself, and thus interactive ML algorithms need to adaptively learn from human feedback, (2) the nature of the environment itself changes rapidly, (3) humans may express their intent in various forms of feedback amenable to naturalistic real-world settings, going beyond tagged rewards or demonstrations. By organizing this workshop, we attempt to bring together interdisciplinary experts in interactive machine learning, reinforcement learning, human-computer interaction, cognitive science, and robotics to explore and foster discussions on such challenges. We envision that this exchange of ideas within and across disciplines can build new bridges, address some of the most valuable challenges in interactive learning with implicit human feedback, and also provide guidance to young researchers interested in growing their careers in this space.
Sat 12:00 p.m. - 12:10 p.m.
|
Organizers: Introductory Remarks
(
Remarks
)
SlidesLive Video » |
🔗 |
Sat 12:10 p.m. - 12:35 p.m.
|
Dorsa Sadigh: Interactive Learning in the Era of Large Models
(
Invited talk
)
SlidesLive Video » In this talk, I will discuss the role of language in learning from interactions with humans. I will first talk about how language instructions along with latent actions can enable shared autonomy in robotic manipulation problems. I will then talk about creative ways of tapping into the rich context of large models to enable more aligned AI agents. Specifically, I will discuss a few vignettes about how we can leverage LLMs and VLMs to learn human preferences, allow for grounded social reasoning, or enable teaching humans using corrective feedback. I will finally conclude the talk by discussing how large models can be effective pattern machines that can identify patterns in a token invariant fashion and enable pattern transformation, extrapolation, and even show some evidence of pattern optimization for solving control problems. |
🔗 |
Sat 12:35 p.m. - 1:00 p.m.
|
Jesse Thomason: Considering The Role of Language in Embodied Systems
(
Invited talk
)
SlidesLive Video » Pretrained language models (PTLM) are "all the rage" right now. From the perspective of folks who have been working at the intersection of language, vision, and robotics since before it was cool, the noticeable impact is that researchers outside NLP feel like they should plug language into their work. However, these models are exclusively trained on text data, usually only for next word prediction, and potentially for next word prediction but under a fine-tuned words-as-actions policy with thousands of underpaid human annotators in the loop (e.g., RLHF). Even when a PTLM is "multimodal" that usually means "training also involved images and their captions, which describe the literal content of the image." What meaning can we hope to extract from those kinds of models in the context of embodied, interactive systems? In this talk, I'll cover some applications our lab has worked through in the space language and embodied systems with a broader lens towards open questions about the limits and (in)appropriate applications of current PTLMs with those systems. |
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee break + Poster Session
|
🔗 |
Sat 1:30 p.m. - 1:55 p.m.
|
Jonathan Grizou: Aiming for internal consistency, the 4th pillar of interactive learning
(
Invited talk
)
SlidesLive Video » I will propose a 2x2 matrix to position interactive learning systems and argue that the 4th corner of that space is yet to be fully explored by our research efforts. By positioning recent work on that matrix, I hope to highlight a possible research direction and expose barriers to be overcome. In that effort, I will attempt a live demonstration of IFTT-PIN, a self-calibrating interface we developed that permits a user to control an interface using signals whose meaning are initially unknown. |
🔗 |
Sat 1:55 p.m. - 2:20 p.m.
|
Daniel Brown: Pitfalls and paths forward when learning rewards from human feedback
(
Invited talk
)
SlidesLive Video » Human feedback is often incomplete, suboptimal, biased, and ambiguous, leading to misidentification of the human's true reward function and suboptimal agent behavior. I will discuss these pitfalls as well as some of our recent work that seeks to overcome these problems via techniques that calibrate to user biases, learn from multiple feedback types, use human feedback to align robot feature representations, and enable interpretable reward learning. |
🔗 |
Sat 2:20 p.m. - 3:10 p.m.
|
Panel Session 1
(
Panel
)
SlidesLive Video » |
🔗 |
Sat 3:10 p.m. - 4:10 p.m.
|
Lunch break
|
🔗 |
Sat 4:10 p.m. - 4:35 p.m.
|
Bradley Knox: The EMPATHIC Framework for Task Learning from Implicit Human Feedback
(
Invited talk
)
SlidesLive Video » Reactions such as gestures, facial expressions, and vocalizations are an abundant, naturally occurring channel of information that humans provide during interactions. A robot or other agent could leverage an understanding of such implicit human feedback to improve its task performance at no cost to the human. This approach contrasts with common agent teaching methods based on demonstrations, critiques, or other guidance that need to be attentively and intentionally provided. In this talk, we first define the general problem of learning from implicit human feedback and then propose to address this problem through a novel data-driven framework, EMPATHIC. This two-stage method consists of (1) mapping implicit human feedback to relevant task statistics such as rewards, optimality, and advantage; and (2) using such a mapping to learn a task. We instantiate the first stage and three second-stage evaluations of the learned mapping. To do so, we collect a dataset of human facial reactions while participants observe an agent execute a sub-optimal policy for a prescribed training task. We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories. |
🔗 |
Sat 4:35 p.m. - 5:00 p.m.
|
David Abel: Three Dogmas of Reinforcement Learning
(
Invited talk
)
SlidesLive Video » Modern reinforcement learning has been in large part shaped by three dogmas. The first is what I call the environment spotlight, which refers to our focus on environments rather than agents. The second is our implicit treatment of learning as finding a solution, rather than endless adaptation. The last is the reward hypothesis, which states that all goals and purposes can be well thought of as maximization of a reward signal. In this talk I discuss how these dogmas have shaped our views on learning. I argue that, when agents learn from human feedback, we ought to dispense entirely with the first two dogmas, while we must recognize and embrace the nuance implicit in the third. |
🔗 |
Sat 5:00 p.m. - 5:25 p.m.
|
Paul Mineiro: Contextual Bandits without Rewards
(
Invited talk
)
SlidesLive Video » Contextual bandits are highly practical, but the need to specify a scalar reward limits their adoption. This motivates study of contextual bandits where a latent reward must be inferred from post-decision observables, aka Interactive Grounded Learning. An information theoretic argument indicates the need for additional assumptions to succeed, and I review some sufficient conditions from the recent literature. I conclude with speculation about composing IGL with active learning. |
🔗 |
Sat 5:25 p.m. - 6:00 p.m.
|
Contributed Talks
(
Talks
)
SlidesLive Video » |
🔗 |
Sat 6:00 p.m. - 6:30 p.m.
|
Coffee break + Poster Session
|
🔗 |
Sat 6:30 p.m. - 7:00 p.m.
|
Taylor Kessler Faulkner: Robots Learning from Real People
(
Invited talk
)
SlidesLive Video » Robots deployed in the wild can improve their performance by using input from human teachers. Furthermore, both robots and humans can benefit when robots adapt to and learn from the people around them. However, real people can act in imperfect ways, and can often be unable to provide input in large quantities. In this talk, I will address some of the past research I have conducted toward addressing these issues, which has focused on creating learning algorithms that can learn from imperfect teachers. I will also talk about my current work on the Robot-Assisted Feeding project in the Personal Robotics Lab at the University of Washington, which I am approaching through a similar lens of working with real teachers and possibly imperfect information. |
🔗 |
Sat 7:00 p.m. - 7:50 p.m.
|
Panel Session 2
(
Panel
)
SlidesLive Video » |
🔗 |
Sat 7:50 p.m. - 8:00 p.m.
|
Organizers: Concluding Remarks
(
Remarks
)
SlidesLive Video » |
🔗 |
-
|
Legible Robot Motion from Conditional Generative Models
(
Poster
)
link »
In human robot collaboration, legible motion thatclearly conveys its intentions and goals is essential. This is because forecasting a robot’s nextmove can lead to an improved user experience,safety, and task efficiency. Current methods forgenerating legible motion utilize hand designedcost functions and classical motion planners, butthere is need for data driven policies that aretrained end-to-end on demonstration data. Inthis paper we propose Generative Legible Motion Models (GLMM), a framework that utilizesconditional generative models to learn legible trajectories from human demonstrations. We findthat GLMM produces motion that is 76% morelegible than standard goal conditioned generativemodels and 83% percent more legible than generative models without goal conditioning. |
Matthew Bronars · Danfei Xu 🔗 |
-
|
Asymptotically Optimal Fixed-Budget Best Arm Identification with Variance-Dependent Bounds
(
Poster
)
link »
We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret. In an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and observes the outcome of the drawn arm. After the experiment, the decision maker recommends the treatment arm with the highest expected outcome. We evaluate the decision based on the expected sim- ple regret, which is the difference between the expected outcomes of the best arm and the recommended arm. Due to inherent uncertainty, we evaluate the regret using the minimax criterion. First, we derive asymptotic lower bounds for the worst-case expected simple regret, which are characterized by the variances of potential outcomes (leading factor). Based on the lower bounds, we propose the Two-Stage (TS)-Hirano- Imbens-Ridder (HIR) strategy, which utilizes the HIR estimator (Hirano et al., 2003) in recommending the best arm. Our theoretical analysis shows that the TS-HIR strategy is asymptotically minimax optimal, meaning that the leading factor of its worst-case expected simple regret matches our derived worst-case lower bound. Addition- ally, we consider extensions of our method, such as the asymptotic optimality for the probability of misidentification. Finally, we validate the proposed method’s effectiveness through simulations |
Masahiro Kato · Masaaki Imaizumi · Takuya Ishihara · Toru Kitagawa 🔗 |
-
|
RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback
(
Poster
)
link »
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback, and to consider human factors involved in providing feedback of different types. However, systematic study of learning from diverse types of feedback is held back by limited standardized tooling available to researchers.To bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for learning from human feedback. RLHF-Blender provides a modular experimentation framework and implementation that enables researchers to systematically investigate the properties and qualities of human feedback for reward learning. The system facilitates the exploration of various feedback types, including demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness. We discuss a set of concrete research opportunities enabled by RLHF-Blender. More information is available at our website. |
Yannick Metz · David Lindner · Raphaël Baur · Daniel Keim · Mennatallah El-Assady 🔗 |
-
|
Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation
(
Poster
)
link »
We study a strategic variant of the multi-armed bandit problem, which we coin the \emph{strategic click-bandit}. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-through rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) aligning incentives by incentivizing desirable arm actions under uncertainty; (b) learning unknown parameters. We approximately characterize all Nash equilibria among arms under UCB-S and show a $\tilde O(\sqrt{KT})$ regret bound in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit setup.
|
Thomas Kleine Büning · Aadirupa Saha · Christos Dimitrakakis · Haifeng Xu 🔗 |
-
|
A Generative Model for Text Control in Minecraft
(
Poster
)
link »
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research. |
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith 🔗 |
-
|
Follow-ups Also Matter: Improving Contextual Bandits via Post-serving Contexts
(
Poster
)
link »
Standard contextual bandit problem assumes that all the relevant contexts are observed before the algorithm chooses an arm. This modeling paradigm, while useful, often falls short when dealing with problems in which additional valuable contexts can be observed after arm selection. For example, content recommendation platforms like Youtube, Instagram, Tiktok receive much additional features about a user's reward after the user clicks a content (e.g., how long the user stayed, what is the user's watch speed, etc.). To improve online learning efficiency in these applications, we study a novel contextual bandit problem with post-serving contexts and design a new algorithm, poLinUCB, that achieves tight regret under standard assumptions. Core to our technical proof is a robustified and generalized version of the well-known Elliptical Potential Lemma (EPL), which can accommodate noise in data. Such robustification is necessary for tackling our problem, though we believe it could also be of general interest. Extensive empirical tests on both synthetic and real-world datasets demonstrate the significant benefit of utilitzing post-serving contexts as well as the superior performance of our algorithm over the state-of-the-art approaches. |
Chaoqi Wang · Ziyu Ye · Zhe Feng · Ashwinkumar Badanidiyuru · Haifeng Xu 🔗 |
-
|
Bayesian Inverse Transition Learning for Offline Settings
(
Poster
)
link »
Offline Reinforcement learning is commonly usedfor sequential decision-making in domains suchas healthcare and education, where the rewardsare known and the transition dynamics T mustbe estimated on the basis of batch data. A keychallenge for all tasks is how to learn a reliable estimateof the transition dynamics T that producenear-optimal policies that are safe enough so thatthey never take actions that are far away from thebest action with respect to their value functionsand informative enough so that they communicatethe uncertainties they have. Using an expert’sfeedback, we propose a new constraint-based approachthat captures our desiderata for reliablylearning a posterior distribution of the transitiondynamics T that is free from gradients. Our resultsdemonstrate that by using our constraints,we learn a high-performing policy, while considerablyreducing the policy’s variance over differentdatasets. We also explain how combining uncertaintyestimation with these constraints can helpus infer a partial ranking of actions that producehigher returns, and helps us infer safer and moreinformative policies for planning. |
Leo Benac · Sonali Parbhoo · Finale Doshi-Velez 🔗 |
-
|
Imitation Learning with Human Eye Gaze via Multi-Objective Prediction
(
Spotlight + Poster
)
link »
Approaches for teaching learning agents via human demonstrations have been widely studied and successfully applied to multiple domains. However, the majority of imitation learning work utilizes only behavioral information from the demonstrator, i.e. which actions were taken, and ignores other useful information. In particular, eye gaze information can give valuable insight towards where the demonstrator is allocating visual attention, and holds the potential to improve agent performance and generalization. In this work, we propose Gaze Regularized Imitation Learning (GRIL), a novel context-aware, imitation learning architecture that learns concurrently from both human demonstrations and eye gaze to solve tasks where visual attention provides important context. We apply GRIL to a visual navigation task, in which an unmanned quadrotor is trained to search for and navigate to a target vehicle in a photorealistic simulated environment. We show that GRIL outperforms several state-of-the-art gaze-based imitation learning algorithms, simultaneously learns to predict human visual attention, and generalizes to scenarios not present in the training data. Supplemental videos and code can be found at https://sites.google.com/view/gaze-regularized-il/. |
Ravi Thakur · MD Sunbeam · Vinicius G. Goecks · Ellen Novoseller · Ritwik Bera · Vernon Lawhern · Greg Gremillion · John Valasek · Nicholas Waytowich 🔗 |
-
|
Learning from a Learning User for Optimal Recommendations
(
Spotlight + Poster
)
link »
In real-world recommendation problems, especially those with a formidably large item space, users have to gradually learn to estimate the utility of any fresh recommendations from their experience about previously consumed items. This in turn affects their interaction dynamics with the system and can invalidate previous algorithms built on the omniscient user assumption. In this paper, we formalize a model to capture such ``learning users'' and design an efficient system-side learning solution, coined Noise-Robust Active Ellipsoid Search (RAES), to confront the challenges brought by the non-stationary feedback from such a learning user. Interestingly, we prove that the regret of RAES deteriorates gracefully as the convergence rate of user learning becomes worse, until reaching linear regret when the user's learning fails to converge. Experiments on synthetic datasets demonstrate the strength of RAES for such a contemporaneous system-user learning problem. Our study provides a novel perspective on modeling the feedback loop in recommendation problems. |
Fan Yao · Chuanhao Li · Denis Nekipelov · Hongning Wang · Haifeng Xu 🔗 |
-
|
Principal-Driven Reward Design and Agent Policy Alignment via Bilevel-RL
(
Poster
)
link »
In reinforcement learning (RL), a reward function is often assumed at the outset of a policy optimization procedure. Learning in such a fixed reward paradigm in RL can neglect important policy optimization considerations, such as state space coverage and safety. Moreover, it can fail to encompass broader impacts in terms of social welfare, sustainability, or market stability, potentially leading to undesirable emergent behavior and potentially misaligned policy. To mathematically encapsulate the problem of aligning RL policy optimization with such externalities, we consider a bilevel optimization problem and connect it to a principal-agent framework, where the principal specifies the broader goals and constraints of the system at the upper level and the agent solves a Markov Decision Process (MDP) at the lower level. The upper-level deals with learning a suitable reward parametrization corresponding to the broader goals and the lower-level deals with learning the policy for the agent. We propose Principal driven Policy Alignment via Bilevel RL (PPA-BRL), which efficiently aligns the policy of the agent with the principal's goals. We explicitly analyzed the dependence of the principal's trajectory on the lower-level policy, and prove the convergence of PPA-BRL to the stationary point of the problem. We illuminate the merits of this framework in view of alignment with several examples spanning energy-efficient manipulation tasks, social welfare-based tax design, and cost-effective robotic navigation. |
Souradip Chakraborty · Amrit Bedi · Alec Koppel · Furong Huang · Mengdi Wang 🔗 |
-
|
Temporally-Extended Prompts Optimization for SAM in Interactive Medical Image Segmentation
(
Poster
)
link »
The $\textit{Segmentation Anything Model}$ (SAM) has recently emerged as a foundation model for addressing image segmentation. Owing to the intrinsic complexity of medical images and the high annotation cost, the medical image segmentation (MIS) community has been encouraged to investigate SAM's zero-shot capabilities to facilitate automatic annotation. Inspired by the extraordinary accomplishments of $\textit{interactive}$ medical image segmentation (IMIS) paradigm, this paper focuses on assessing the potential of SAM's zero-shot capabilities within the IMIS paradigm to amplify its benefits in the MIS domain. Regrettably, we observe that SAM's vulnerability to prompt modalities (e.g., points, bounding boxes) becomes notably pronounced in IMIS. This leads us to develop a mechanism that adaptively offers suitable prompt modalities for human experts. We refer to the mechanism above as $\textit{temporally-extended prompts optimization}$ (TEPO) and model it as a Markov decision process, solvable through reinforcement learning. Numerical experiments on the standardized benchmark $\texttt{Brats2020}$ demonstrate that the learned TEPO agent can further enhance SAM's zero-shot capability in the MIS context.
|
Chuyun Shen · Wenhao Li · Ya Zhang · Xiangfeng Wang 🔗 |
-
|
Survival Instinct in Offline Reinforcement Learning and Implicit Human Bias in Data
(
Spotlight + Poster
)
link »
We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain human bias implicit in common data collection practices. As we prove in this work, pessimism endows the agent with a survival instinct, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is "nudged" to learn a desirable behavior with imperfect reward but purposely biased data coverage. |
Anqi Li · Dipendra Misra · Andrey Kolobov · Ching-An Cheng 🔗 |
-
|
Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences
(
Poster
)
link »
Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement for AI agents. Interactive reward learning from trajectory comparisons (a.k.a. RLHF) is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. |
Lin Guan · Karthik Valmeekam · Subbarao Kambhampati 🔗 |
-
|
Complementing a Policy with a Different Observation Space
(
Poster
)
link »
We consider the problem of improving upon a black-box policy which operates on a different observation space than the learner. Such problems occur when augmenting an existing hand-engineered system with a new machine learning model or in a shared autonomy / human-AI complementarity context. We prove that following the naive policy gradient can lead to a decrease in performance because of incorrect grounding in a different observation space. Then, if we have access to both sets of observation at train time, we derive a method for correctly estimating a policy gradient via an application of the backdoor criterion. If we don't, we prove that under certain assumptions, we can use the proxy correction to correctly estimate a direction of improvement. |
Gokul Swamy · Sanjiban Choudhury · J. Bagnell · Steven Wu 🔗 |
-
|
Cognitive Models as Simulators: Using Cognitive Models to Tap into Implicit Human Feedback
(
Poster
)
link »
In this work, we substantiate the idea of $\textit{cognitive models as simulators}$, which is to have AI systems interact with, and collect feedback from, cognitive models instead of humans, thereby making the training process safer, cheaper, and faster. We leverage this idea in the context of learning a fair behavior toward a counterpart exhibiting various emotional states — as implicit human feedback. As a case study, we adopt the Ultimatum game (UG), a canonical task in behavioral and brain sciences for studying fairness. We show that our reinforcement learning (RL) agents learn to exhibit differential, rationally-justified behaviors under various emotional states of their UG counterpart. We discuss the implications of our work for AI and cognitive science research, and its potential for interactive learning with implicit human feedback.
|
Ardavan S. Nobandegani · Thomas Shultz · Irina Rish 🔗 |
-
|
Unraveling the ARC Puzzle: Mimicking Human Solutions with Object-Centric Decision Transformer
(
Poster
)
link »
In the pursuit of artificial general intelligence (AGI), we tackle Abstraction and Reasoning Corpus (ARC) tasks using a novel two-pronged approach. We employ the Decision Transformer in an imitation learning paradigm to model human problem-solving, and introduce an objectdetection algorithm, the Push and Pull clustering method. This dual strategy enhances AI’s ARC problem-solving skills and provides insights for AGI progression. Yet, our work reveals the need for advanced data collection tools, robust training datasets, and refined model structures. This study highlights potential improvements for Decision Transformers and propels future AGI research. |
JAEHYUN PARK · Jaegyun Im · Sanha Hwang · Mintaek Lim · Sabina Ualibekova · Sejin Kim · Sundong Kim 🔗 |
-
|
Selective Sampling and Imitation Learning via Online Regression
(
Poster
)
link »
We consider the problem of Imitation Learning (IL) by actively querying noisy expert for feedback. While imitation learning has been empirically successful, much of prior work assumes access to noiseless expert feedback which is not practical in many applications. In fact, when one only has access to noisy expert feedback, algorithms that rely on purely offline data (non-interactive IL) can be shown to need a prohibitively large number of samples to be successful. In contrast, in this work, we provide an interactive algorithm for IL that uses selective sampling to actively query the noisy expert for feedback. Our contributions are twofold: First, we provide a new selective sampling algorithm that works with general function classes and multiple actions, and obtains the best-known bounds for the regret and the number of queries. Next, we extend this analysis to the problem of IL with noisy expert feedback and provide a new IL algorithm that makes limited queries. Our algorithm for selective sampling leverages function approximation, and relies on an online regression oracle w.r.t.~the given model class to predict actions, and to decide whether to query the expert for its label. On the theoretical side, the regret bound of our algorithm is upper bounded by the regret of the online regression oracle, while the query complexity additionally depends on the eluder dimension of the model class. We complement this with a lower bound that demonstrates that our results are tight. We extend our selective sampling algorithm for IL with general function approximation and provide bounds on both the regret and the number of queries made to the noisy expert. A key novelty here is that our regret and query complexity bounds only depend on the number of times the optimal policy (and not the noisy expert, or the learner) go to states that have a small margin. |
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu 🔗 |
-
|
Learning Shared Safety Constraints from Multi-task Demonstrations
(
Poster
)
link »
Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks. |
Konwoo Kim · Gokul Swamy · Zuxin Liu · Ding Zhao · Sanjiban Choudhury · Steven Wu 🔗 |
-
|
Strategic Apple Tasting
(
Poster
)
link »
Algorithmic decision-making in high-stakes domains often involves assigning decisions to agents with incentives to strategically modify their input to the algorithm. In addition to dealing with incentives, in many domains of interest (e.g. lending and hiring) the decision-maker only observes feedback regarding their policy for rounds in which they assign a positive decision to the agent; this type of feedback is often referred to as apple tasting (or one-sided) feedback. We formalize this setting as an online learning problem with apple-tasting feedback where a principal makes decisions about a sequence of $T$ agents, each of which is represented by a context that may be strategically modified. Our goal is to achieve sublinear strategic regret, which compares the performance of the principal to that of the best fixed policy in hindsight, if the agents were truthful when revealing their contexts. Our main result is a learning algorithm which incurs $\Tilde{\mathcal{O}}(\sqrt{T})$ strategic regret when the sequence of agents is chosen stochastically. We also give an algorithm capable of handling adversarially-chosen agents, albeit at the cost of $\Tilde{\mathcal{O}}(T^{(d+1)/(d+2)})$ strategic regret (where $d$ is the dimension of the context). Our algorithms can be easily adapted to the setting where the principal receives bandit feedback---this setting generalizes both the linear contextual bandit problem (by considering agents with incentives) and the strategic classification problem (by allowing for partial feedback).
|
Keegan Harris · Chara Podimata · Steven Wu 🔗 |
-
|
Discovering User Types: Characterization of User Traits by Task-Specific Behaviors in Reinforcement Learning
(
Poster
)
link »
We often want to infer user traits when personalizing interventions. Approaches like Inverse RL can learn traits formalized as parameters of a Markov Decision Process but are data intensive. Instead of inferring traits for individuals, we study the relationship between RL worlds and the set of user traits. We argue that understanding the breakdown of ``user types" within a world -- broad sets of traits that result in the same behavior -- helps rapidly personalize interventions. We show that seemingly different RL worlds admit the same set of user types and formalize this observation as an equivalence relation defined on worlds. We show that these equivalence classes capture many different worlds. We argue that the richness of these classes allows us to transfer insights on intervention design between toy and real worlds. |
Lars L. Ankile · Brian Ham · Kevin Mao · Eura Shin · Siddharth Swaroop · Finale Doshi-Velez · Weiwei Pan 🔗 |
-
|
Provable Offline Reinforcement Learning with Human Feedback
(
Poster
)
link »
In this paper, we investigate the problem of offline reinforcement learning with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs. |
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun 🔗 |
-
|
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
(
Poster
)
link »
We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting large language models (LLMs) to enhance task completion performance. The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes. The Swift module is a small encoder-decoder LM fine-tuned on the oracle agent's action trajectories, while the Sage module employs LLMs such as GPT-4 for subgoal planning and grounding. We develop a heuristic method to harmoniously integrate the two modules, resulting in a more efficient and robust problem-solving process. In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflexion, demonstrating its effectiveness in solving complex real-world tasks. |
Yuchen Lin · Yicheng Fu · Karina Yang · Prithviraj Ammanabrolu · Faeze Brahman · Shiyu Huang · Chandra Bhagavatula · Yejin Choi · Xiang Ren 🔗 |
-
|
How to Query Human Feedback Efficiently in RL?
(
Poster
)
link »
Reinforcement Learning with Human Feedback (RLHF) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While RLHF has demonstrated practical success in fine-tuning language models, existing empirical work does not address the challenge of how to efficiently sample trajectory pairs for querying human feedback. In this study, we propose an efficient sampling approach to acquiring exploratory trajectories that enable accurate learning of hidden reward functions before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate RLHF with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario. |
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee 🔗 |
-
|
Contextual Bandits and Imitation Learning with Preference-Based Active Queries
(
Poster
)
link »
We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.
|
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu 🔗 |
-
|
Bayesian Active Meta-Learning under Prior Misspecification
(
Poster
)
link »
We study a setting in which an active meta-learner aims to separate the idiosyncracies of a particular task environment from information that will transfer between task environments. In a Bayesian setting, this is accomplished by leveraging a prior distribution on the amount of transferable and task-specific information an observation will yield, inducing a large dependency on this prior when data is scarce or environments change frequently. However, a misspecified prior can lead to bias in the inferences made on the basis of the resulting posterior --- i.e., to the acquisition of non-transferable information. For an active meta-learner, this poses a dilemma: should they seek transferable information on the basis of their possibly misspecified prior beliefs, or task-specific information that enables better identification of the current task environment? Using the framework of Bayesian experimental design, we develop a novel diagnostic to detect the risk of non-transferable information acquisition, and leverage this diagnostic to propose an intuitive yet principled way to navigate the meta-learning dilemma --- namely, seek task-specific information when there is risk of non-transferable information acquisition, and transferable information otherwise. We provide a proof-of-concept of our approach in the context of an experiment with synthetic participants. |
Sabina Sloman · Ayush Bharti · Samuel Kaski 🔗 |
-
|
UCB Provably Learns From Inconsistent Human Feedback
(
Spotlight + Poster
)
link »
In this paper, we study how to learn from inconsistent human feedback in the setting of combinatorial bandits with semi-bandit feedback -- where an online learner in every time step chooses a size-$k$ set of arms, observes a stochastic reward for each arm, and endeavors to maximize the sum of the per-arm rewards in the set. We consider the challenging setting where these per-arm rewards are not only set-dependent, but also {\em inconsistent:} the expected reward of arm "a" can be larger than arm "b" in one set, but smaller in another. Inconsistency is often observed in practice, falls outside the purview of many popular semi-bandit models, and in general can result in it being combinatorially hard to find the optimal set.Motivated by the observed practice of using UCB-based algorithms even in settings where they are not strictly justified, our main contribution is to present a simple assumption - weak optimal set consistency. We show that this assumption allows for inconsistent set-dependent arm rewards, and also subsumes many widely used models for semi-bandit feedback. Most importantly, we show that it ensures that a simple UCB-based algorithm finds the optimal set, and achieves $O\left(\min(\frac{k^3 n \log T}{\epsilon}, k^2\sqrt{n T \log T})\right)$ regret (which nearly matches the lower bound).
|
Shuo Yang · Tongzheng Ren · Inderjit Dhillon · Sujay Sanghavi 🔗 |
-
|
Contextual Set Selection Under Human Feedback With Model Misspecification
(
Poster
)
link »
A common and efficient way to elicit human feedback is to present users with a set of options, and record their relative preferences on the presented options. The contextual combinatorial bandits problem captures this setting algorithmically; however, it implicitly assumes an underlying consistent reward model for the options. The setting of human feedback (which e.g. may use different reviewers for different samples) means that there may not be any such model -- it is *misspecified*. We first derive a lower-bound for our setting, and then show that model misspecification can lead to catastrophic failure of the C$^2$UCB algorithm (which is otherwise near-optimal when there is no misspecification). We then propose two algorithms: the first algorithm (MC$^2$UCB) requires knowledge of the level of misspecification $\epsilon$ (i.e., the absolute deviation from the closest well-specified model). The second algorithm is a general framework that extends to unknown $\epsilon$. Our theoretical analysis shows that both algorithms achieve near-optimal regret. Further empirical evaluations, conducted both in a synthetic environment and a real-world application of movie recommendations, demonstrate the adaptability of our algorithm to various degrees of misspecification. This highlights the algorithm's ability to effectively learn from human feedback, even with model misspecification.
|
Shuo Yang · Rajat Sen · Sujay Sanghavi 🔗 |
-
|
Building Community Driven Libraries of Natural Programs
(
Poster
)
link »
A typical way in which a machine acquires knowledge from humans is through programs -- sequences of executable commands that can be composed hierarchically. By building a library of programs, a machine can quickly learn how to perform complex tasks. However, as programs are typically created for specific situations, they become brittle when the contexts change, making it difficult compound knowledge learned from different teachers and contexts. We present natural programming, a library building procedure where each program is represented as a \emph{search problem} containing both a goal and linguistic hints on how to decompose it into sub-goals. A natural program is executed via search in a manner of hierarchical planning and guided by a large language model, effectively adapting learned programs to new contexts. After each successful execution, natural programming learns by improving search, rather than memorizing the solution sequence of commands. Simulated studies and a human experiment (n=360) on a simple crafting environment demonstrate that natural programming can robustly compose programs learned from different users and contexts, solving more complex tasks when compared to baselines that maintain libraries of command sequences. |
Leonardo Hernandez Cano · Yewen Pu · Robert Hawkins · Josh Tenenbaum · Armando Solar-Lezama 🔗 |
-
|
Modeled Cognitive Feedback to Calibrate Uncertainty for Interactive Learning
(
Poster
)
link »
Many interactive learning environments use some measure of uncertainty to estimate how likely the model output is to be correct. The reliability of these estimates is diminished when changes in the environment cause incoming data to drift away from the data the model was trained on. While interactive learning approaches can use implicit feedback to help tune machine learning models to identify and respond to concept drift more quickly, this approach still requires waiting for user feedback before the problem of concept drift can be addressed. We propose that modeled cognitive feedback can supplement implicit feedback by providing human-tuned features to train an uncertainty model that is more resilient to concept drift. In this paper, we introduce modeled cognitive feedback to support interactive learning, and show that an uncertainty model with cognitive features performs better than a baseline model in an environment with concept drift. |
Jaelle Scheuerman · Zachary Bishof · Chris Michael 🔗 |
-
|
Improving Bionic Limb Control through Reinforcement Learning in an Interactive Game Environment
(
Poster
)
link »
Enhancing the accuracy and robustness of bionic limb controllers that decode motor intent is a pressing challenge in the field of prosthetics. State-of-the-art research has mostly focused on Supervised Learning techniques to tackle this problem. However, obtaining high-quality labeled data that accurately represents muscle activity during daily usage remains difficult. In this work, we investigate the potential of Reinforcement Learning to further improve the decoding of human motion intent by incorporating usage-based data. We propose a new method which starts with a control policy, pretrained on a static recording of electromyograhic (EMG) ground truth data. We then fine-tune the pretrained classifier with dynamic EMG data obtained during interaction with a game environment developed for this work. We evaluate our approach in real-time experiments, showing substantial improvements for human-in-the-loop performance. The method proves more effective in predicting simultaneous finger movements, doubling the decoding accuracy both during gameplay and in a separate motion test. |
Kilian Freitag · Rita Laezza · Jan Zbinden · Max Ortiz-Catalan 🔗 |
-
|
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
(
Poster
)
link »
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multi-policy strategy. Rather than focusing on a single a priori reward, we aim for Pareto-optimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when fine-tuned on diverse rewards from a shared pre-trained initialization. We demonstrate the effectiveness of our approach for text-to-text (summarization, helpful assistant), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity. |
Alexandre Rame · Guillaume Couairon · Corentin Dancette · Mustafa Shukor · Jean-Baptiste Gaya · Laure Soulier · Matthieu Cord 🔗 |
-
|
Inverse Preference Learning: Preference-based RL without a Reward Function
(
Poster
)
link »
Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods na"ively combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters.
|
Joey Hejna · Dorsa Sadigh 🔗 |
-
|
Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction
(
Poster
)
link »
Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. For instance, in the absence of additional context, a source query in one language may yield several translation options in another language. Only one translation could be acceptable however, depending on the translator's preferences and goals. Choosing the incorrect option might significantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting --- a series of question, answering and generation intermediate steps between a Translator model and a User model --- that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final translated text. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which lead to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting, using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities. |
Jonathan Pilault · Xavier Garcia · Arthur Brazinskas · Orhan Firat 🔗 |
-
|
Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism
(
Poster
)
link »
In this paper we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. We focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices, which is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method and prove that the suboptimality of DCPPO \textit{almost} matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model. |
Zihao Li · Zhuoran Yang · Mengdi Wang 🔗 |
-
|
Accelerating exploration and representation learning with offline pre-training
(
Poster
)
link »
Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement learning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent’s intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these componentscould be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights. |
Bogdan Mazoure · Jake Bruce · Doina Precup · Rob Fergus · Ankit Anand 🔗 |
-
|
Active Learning with Crowd Sourcing Improves Information Retrieval
(
Poster
)
link »
In this work, we show how to collect and use human feedback to improve complex models in information retrieval systems. Human feedback often improves model performance, yet little has been shown to combine human feedback and model tuning in an end-to-end setup with public resources. To this end, we develop a system called Crowd-Coachable Retriever (CCR), where we use crowd-sourced workers and open-source software to improve information retrieval systems, by asking humans to label the best document from a short list of retrieved documents to answer a randomly chosen query at a time. We consider two unique contributions. First, our exploration space contains millions of possible documents yet we carefully select a few candidates to a given query to reduce human workload. Secondly, we use latent-variable methods to cross-validate human labels to improve their quality. We benchmark CCR on two large-scale information retrieval datasets, where we actively learn the most relevant documents using baseline models and crowd workers, without accessing the given labels from the original datasets. We show that CCR robustly improves the model performance beyond the zero-shot baselinesand we discuss some key differences with active learning simulations based on holdout data. |
Zhuotong Chen · Yifei Ma · Branislav Kveton · Anoop Deoras 🔗 |
-
|
Visual-based Policy Learning with Latent Language Encoding
(
Spotlight + Poster
)
link »
Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning.However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments.In this work, we introduce a novel learning paradigm that generates robots’ executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks.Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target testing tasks effectively.We conducted extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm. |
Jielin Qiu · Mengdi Xu · William Han · Bo Li · Ding Zhao 🔗 |
-
|
Guided Policy Search for Parameterized Skills using Adverbs
(
Poster
)
link »
We present a method for using adverb phrases to adjust skill parameters via learned \textit{adverb-skill groundings}. These groundings allow an agent to use adverb feedback provided by a human to directly update a skill policy in a manner similar to traditional local policy search methods. We show that our method can be used as a drop-in replacement for these policy search methods when dense reward from the environment is not available but human language feedback is. We demonstrate improved sample efficiency over modern policy search methods in two experiments. |
Benjamin Spiegel · George Konidaris 🔗 |
Author Information
Andi Peng (MIT)
Akanksha Saran (Microsoft Research)
Andreea Bobu (University of California Berkeley)
Tengyang Xie (University of Illinois at Urbana-Champaign)
Pierre-Yves Oudeyer (Inria)
Dr. Pierre-Yves Oudeyer is Research Director (DR1) at Inria and head of the Inria and Ensta-ParisTech FLOWERS team (France). Before, he has been a permanent researcher in Sony Computer Science Laboratory for 8 years (1999-2007). After working on computational models of language evolution, he is now working on developmental and social robotics, focusing on sensorimotor development, language acquisition and life-long learning in robots. Strongly inspired by infant development, the mechanisms he studies include artificial curiosity, intrinsic motivation, the role of morphology in learning motor control, human-robot interfaces, joint attention and joint intentional understanding, and imitation learning. He has published a book, more than 80 papers in international journals and conferences, holds 8 patents, gave several invited keynote lectures in international conferences, and received several prizes for his work in developmental robotics and on the origins of language. In particular, he is laureate of the ERC Starting Grant EXPLORERS. He is editor of the IEEE CIS Newsletter on Autonomous Mental Development, and associate editor of IEEE Transactions on Autonomous Mental Development, Frontiers in Neurorobotics, and of the International Journal of Social Robotics. He is also working actively for the diffusion of science towards the general public, through the writing of popular science articles and participation to radio and TV programs as well as science exhibitions. Web:http://www.pyoudeyer.com and http://flowers.inria.fr
Anca Dragan (University of California, Berkeley)
John Langford (Microsoft Research)
More from the Same Authors
-
2021 : Provable RL with Exogenous Distractors via Multistep Inverse Dynamics »
Yonathan Efroni · Dipendra Misra · Akshay Krishnamurthy · Alekh Agarwal · John Langford -
2021 : Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2022 : Interaction-Grounded Learning with Action-inclusive Feedback »
Tengyang Xie · Akanksha Saran · Dylan Foster · Lekan Molu · Ida Momennejad · Nan Jiang · Paul Mineiro · John Langford -
2022 : A Study of Causal Confusion in Preference-Based Reward Learning »
Jeremy Tien · Zhiyang He · Zackory Erickson · Anca Dragan · Daniel S Brown -
2023 : Preventing Reward Hacking with Occupancy Measure Regularization »
Cassidy Laidlaw · Shivam Singhal · Anca Dragan -
2023 : Preventing Reward Hacking with Occupancy Measure Regularization »
Cassidy Laidlaw · Shivam Singhal · Anca Dragan -
2023 : Video-Guided Skill Discovery »
Manan Tomar · Dibya Ghosh · Vivek Myers · Anca Dragan · Matthew Taylor · Philip Bachman · Sergey Levine -
2023 : Bridging RL Theory and Practice with the Effective Horizon »
Cassidy Laidlaw · Stuart Russell · Anca Dragan -
2023 : Learning Optimal Advantage from Preferences and Mistaking it for Reward »
William Knox · Stephane Hatgis-Kessell · Sigurdur Adalgeirsson · Serena Booth · Anca Dragan · Peter Stone · Scott Niekum -
2023 : The SocialAI School: Insights from Developmental Psychology Towards Artificial Socio-Cultural Agents »
Grgur Kovač · Rémy Portelas · Peter F Dominey · Pierre-Yves Oudeyer -
2023 Poster: Contextual Reliability: When Different Features Matter in Different Contexts »
Gaurav Ghosal · Amrith Setlur · Daniel S Brown · Anca Dragan · Aditi Raghunathan -
2023 Poster: Diagnosis, Feedback, Adaptation: A Human-in-the-Loop Framework for Test-Time Policy Adaptation »
Andi Peng · Aviv Netanyahu · Mark Ho · Tianmin Shu · Andreea Bobu · Julie Shah · Pulkit Agrawal -
2023 Poster: Automatically Auditing Large Language Models via Discrete Optimization »
Erik Jones · Anca Dragan · Aditi Raghunathan · Jacob Steinhardt -
2023 Poster: Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning »
Thomas Carta · Clément Romac · Thomas Wolf · sylvain lamprier · Olivier Sigaud · Pierre-Yves Oudeyer -
2023 Tutorial: Discovering Agent-Centric Latent States in Theory and in Practice »
John Langford · Alex Lamb -
2023 Expo Talk Panel: Vowpal Wabbit: year in review and looking ahead in an LLM world »
John Langford · Byron Xu · Cheng Tan · Jack Gerrits · Lili Wu · Mark Rucker · Olga Vrousgou -
2022 : Beyond Learning from Demonstrations »
Andreea Bobu -
2022 Poster: Asking for Knowledge (AFK): Training RL Agents to Query External Knowledge Using Language »
Iou-Jen Liu · Xingdi Yuan · Marc-Alexandre Côté · Pierre-Yves Oudeyer · Alex Schwing -
2022 Poster: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2022 Spotlight: Asking for Knowledge (AFK): Training RL Agents to Query External Knowledge Using Language »
Iou-Jen Liu · Xingdi Yuan · Marc-Alexandre Côté · Pierre-Yves Oudeyer · Alex Schwing -
2022 Oral: Adversarially Trained Actor Critic for Offline Reinforcement Learning »
Ching-An Cheng · Tengyang Xie · Nan Jiang · Alekh Agarwal -
2022 Poster: Personalization Improves Privacy-Accuracy Tradeoffs in Federated Learning »
Alberto Bietti · Chen-Yu Wei · Miroslav Dudik · John Langford · Steven Wu -
2022 Poster: Contextual Bandits with Large Action Spaces: Made Practical »
Yinglun Zhu · Dylan Foster · John Langford · Paul Mineiro -
2022 Spotlight: Contextual Bandits with Large Action Spaces: Made Practical »
Yinglun Zhu · Dylan Foster · John Langford · Paul Mineiro -
2022 Spotlight: Personalization Improves Privacy-Accuracy Tradeoffs in Federated Learning »
Alberto Bietti · Chen-Yu Wei · Miroslav Dudik · John Langford · Steven Wu -
2022 Poster: Estimating and Penalizing Induced Preference Shifts in Recommender Systems »
Micah Carroll · Anca Dragan · Stuart Russell · Dylan Hadfield-Menell -
2022 Spotlight: Estimating and Penalizing Induced Preference Shifts in Recommender Systems »
Micah Carroll · Anca Dragan · Stuart Russell · Dylan Hadfield-Menell -
2022 : Learning to interact: PARTIAL OBSERVABILITY + GAME Theory of mind on steroids »
Anca Dragan -
2022 : Learning to interact: PARTIAL OBSERVABILITY The actions you take as part of the task are the queries! »
Anca Dragan -
2022 : Q&A »
Dorsa Sadigh · Anca Dragan -
2022 Tutorial: Learning for Interactive Agents »
Dorsa Sadigh · Anca Dragan -
2022 : Learning objectives and preferences: WHAT DATA? From diverse types of human data »
Anca Dragan -
2022 : Introduction »
John Langford -
2021 : Robust Robot Learning via Human-Guided Intermediate Representations »
Andreea Bobu -
2021 : RL Foundation Panel »
Matthew Botvinick · Thomas Dietterich · Leslie Kaelbling · John Langford · Warrren B Powell · Csaba Szepesvari · Lihong Li · Yuxi Li -
2021 Poster: Interaction-Grounded Learning »
Tengyang Xie · John Langford · Paul Mineiro · Ida Momennejad -
2021 Spotlight: Interaction-Grounded Learning »
Tengyang Xie · John Langford · Paul Mineiro · Ida Momennejad -
2021 Poster: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2021 Poster: ChaCha for Online AutoML »
Qingyun Wu · Chi Wang · John Langford · Paul Mineiro · Marco Rossi -
2021 Poster: TeachMyAgent: a Benchmark for Automatic Curriculum Learning in Deep RL »
Clément Romac · Rémy Portelas · Katja Hofmann · Pierre-Yves Oudeyer -
2021 Spotlight: ChaCha for Online AutoML »
Qingyun Wu · Chi Wang · John Langford · Paul Mineiro · Marco Rossi -
2021 Spotlight: TeachMyAgent: a Benchmark for Automatic Curriculum Learning in Deep RL »
Clément Romac · Rémy Portelas · Katja Hofmann · Pierre-Yves Oudeyer -
2021 Spotlight: Batch Value-function Approximation with Only Realizability »
Tengyang Xie · Nan Jiang -
2021 Poster: Policy Gradient Bayesian Robust Optimization for Imitation Learning »
Zaynah Javed · Daniel Brown · Satvik Sharma · Jerry Zhu · Ashwin Balakrishna · Marek Petrik · Anca Dragan · Ken Goldberg -
2021 Spotlight: Policy Gradient Bayesian Robust Optimization for Imitation Learning »
Zaynah Javed · Daniel Brown · Satvik Sharma · Jerry Zhu · Ashwin Balakrishna · Marek Petrik · Anca Dragan · Ken Goldberg -
2021 Town Hall: Town Hall »
John Langford · Marina Meila · Tong Zhang · Le Song · Stefanie Jegelka · Csaba Szepesvari -
2021 Poster: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2021 Spotlight: Value Alignment Verification »
Daniel Brown · Jordan Schneider · Anca Dragan · Scott Niekum -
2021 Expo Workshop: Real World RL: Azure Personalizer & Vowpal Wabbit »
Sheetal Lahabar · Etienne Kintzler · Mark Rucker · Bogdan Mazoure · Qingyun Wu · Pavithra Srinath · Jack Gerrits · Olga Vrousgou · John Langford · Eduardo Salinas -
2020 : Invited Talk 7: Prof. Anca Dragan from UC Berkeley »
Anca Dragan -
2020 : "Active Learning through Physically-embodied, Synthesized-from-“scratch” Queries" »
Anca Dragan -
2020 : Discussion Panel »
Krzysztof Dembczynski · Prateek Jain · Alina Beygelzimer · Inderjit Dhillon · Anna Choromanska · Maryam Majzoubi · Yashoteja Prabhu · John Langford -
2020 Workshop: Workshop on eXtreme Classification: Theory and Applications »
Anna Choromanska · John Langford · Maryam Majzoubi · Yashoteja Prabhu -
2020 Poster: Learning Human Objectives by Evaluating Hypothetical Behavior »
Siddharth Reddy · Anca Dragan · Sergey Levine · Shane Legg · Jan Leike -
2020 Poster: Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning »
Dipendra Kumar Misra · Mikael Henaff · Akshay Krishnamurthy · John Langford -
2019 Workshop: Exploration in Reinforcement Learning Workshop »
Benjamin Eysenbach · Benjamin Eysenbach · Surya Bhupatiraju · Shixiang Gu · Harrison Edwards · Martha White · Pierre-Yves Oudeyer · Kenneth Stanley · Emma Brunskill -
2019 : panel discussion with Craig Boutilier (Google Research), Emma Brunskill (Stanford), Chelsea Finn (Google Brain, Stanford, UC Berkeley), Mohammad Ghavamzadeh (Facebook AI), John Langford (Microsoft Research) and David Silver (Deepmind) »
Peter Stone · Craig Boutilier · Emma Brunskill · Chelsea Finn · John Langford · David Silver · Mohammad Ghavamzadeh -
2019 : Poster Session 1 (all papers) »
Matilde Gargiani · Yochai Zur · Chaim Baskin · Evgenii Zheltonozhskii · Liam Li · Ameet Talwalkar · Xuedong Shang · Harkirat Singh Behl · Atilim Gunes Baydin · Ivo Couckuyt · Tom Dhaene · Chieh Lin · Wei Wei · Min Sun · Orchid Majumder · Michele Donini · Yoshihiko Ozaki · Ryan P. Adams · Christian Geißler · Ping Luo · zhanglin peng · · Ruimao Zhang · John Langford · Rich Caruana · Debadeepta Dey · Charles Weill · Xavi Gonzalvo · Scott Yang · Scott Yak · Eugen Hotaj · Vladimir Macko · Mehryar Mohri · Corinna Cortes · Stefan Webb · Jonathan Chen · Martin Jankowiak · Noah Goodman · Aaron Klein · Frank Hutter · Mojan Javaheripi · Mohammad Samragh · Sungbin Lim · Taesup Kim · SUNGWOONG KIM · Michael Volpp · Iddo Drori · Yamuna Krishnamurthy · Kyunghyun Cho · Stanislaw Jastrzebski · Quentin de Laroussilhe · Mingxing Tan · Xiao Ma · Neil Houlsby · Andrea Gesmundo · Zalán Borsos · Krzysztof Maziarz · Felipe Petroski Such · Joel Lehman · Kenneth Stanley · Jeff Clune · Pieter Gijsbers · Joaquin Vanschoren · Felix Mohr · Eyke Hüllermeier · Zheng Xiong · Wenpeng Zhang · Wenwu Zhu · Weijia Shao · Aleksandra Faust · Michal Valko · Michael Y Li · Hugo Jair Escalante · Marcel Wever · Andrey Khorlin · Tara Javidi · Anthony Francis · Saurajit Mukherjee · Jungtaek Kim · Michael McCourt · Saehoon Kim · Tackgeun You · Seungjin Choi · Nicolas Knudde · Alexander Tornede · Ghassen Jerfel -
2019 : invited talk by John Langford (Microsoft Research): How do we make Real World Reinforcement Learning revolution? »
John Langford -
2019 Poster: Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback »
Chicheng Zhang · Alekh Agarwal · Hal Daumé III · John Langford · Sahand Negahban -
2019 Poster: On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference »
Rohin Shah · Noah Gundotra · Pieter Abbeel · Anca Dragan -
2019 Oral: Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback »
Chicheng Zhang · Alekh Agarwal · Hal Daumé III · John Langford · Sahand Negahban -
2019 Oral: On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference »
Rohin Shah · Noah Gundotra · Pieter Abbeel · Anca Dragan -
2019 Poster: CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning »
Cédric Colas · Pierre-Yves Oudeyer · Olivier Sigaud · Pierre Fournier · Mohamed Chetouani -
2019 Poster: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford -
2019 Poster: Learning a Prior over Intent via Meta-Inverse Reinforcement Learning »
Kelvin Xu · Ellis Ratner · Anca Dragan · Sergey Levine · Chelsea Finn -
2019 Poster: Contextual Memory Trees »
Wen Sun · Alina Beygelzimer · Hal Daumé III · John Langford · Paul Mineiro -
2019 Oral: Provably efficient RL with Rich Observations via Latent State Decoding »
Simon Du · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal · Miroslav Dudik · John Langford -
2019 Oral: Learning a Prior over Intent via Meta-Inverse Reinforcement Learning »
Kelvin Xu · Ellis Ratner · Anca Dragan · Sergey Levine · Chelsea Finn -
2019 Oral: CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning »
Cédric Colas · Pierre-Yves Oudeyer · Olivier Sigaud · Pierre Fournier · Mohamed Chetouani -
2019 Oral: Contextual Memory Trees »
Wen Sun · Alina Beygelzimer · Hal Daumé III · John Langford · Paul Mineiro -
2018 Poster: An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning »
Dhruv Malik · Malayandi Palaniappan · Jaime Fisac · Dylan Hadfield-Menell · Stuart Russell · Anca Dragan -
2018 Poster: GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms »
Cédric Colas · Olivier Sigaud · Pierre-Yves Oudeyer -
2018 Oral: An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning »
Dhruv Malik · Malayandi Palaniappan · Jaime Fisac · Dylan Hadfield-Menell · Stuart Russell · Anca Dragan -
2018 Oral: GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms »
Cédric Colas · Olivier Sigaud · Pierre-Yves Oudeyer -
2018 Poster: Learning Deep ResNet Blocks Sequentially using Boosting Theory »
Furong Huang · Jordan Ash · John Langford · Robert Schapire -
2018 Oral: Learning Deep ResNet Blocks Sequentially using Boosting Theory »
Furong Huang · Jordan Ash · John Langford · Robert Schapire -
2017 Poster: Contextual Decision Processes with low Bellman rank are PAC-Learnable »
Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2017 Talk: Contextual Decision Processes with low Bellman rank are PAC-Learnable »
Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2017 Poster: Logarithmic Time One-Against-Some »
Hal Daumé · Nikos Karampatziakis · John Langford · Paul Mineiro -
2017 Poster: Active Learning for Cost-Sensitive Classification »
Akshay Krishnamurthy · Alekh Agarwal · Tzu-Kuo Huang · Hal Daumé III · John Langford -
2017 Talk: Active Learning for Cost-Sensitive Classification »
Akshay Krishnamurthy · Alekh Agarwal · Tzu-Kuo Huang · Hal Daumé III · John Langford -
2017 Talk: Logarithmic Time One-Against-Some »
Hal Daumé · Nikos Karampatziakis · John Langford · Paul Mineiro -
2017 Tutorial: Real World Interactive Learning »
Alekh Agarwal · John Langford