Workshop
Neural Conversational AI Workshop - What’s left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) chatbots?
Hyundong Cho · Nayeon Lee · Ninareh Mehrabi · Hsuan Su · Jonathan May · Hung-yi Lee · Ahmad Beirami
Meeting Room 313
The recent breathtaking progress made in generative natural language processing (NLP) has been propelled by large language models and innovative learning methods that intersects machine learning (ML) and NLP such as Reinforcement Learning with Human Feedback (RLHF), leading to the creation of impressive chatbots like ChatGPT. However their lack of groundedness, factuality, and interoperability with tools and custom APIs limits them to mostly creative endeavors due to low fidelity and reliability. On the contrary, digital assistants in the real world such as Siri, Alexa, and Google Assistant can interface with proprietary APIs, but they still cover a relatively narrow set of use cases that are mostly simple single-turn interactions. Through the combination of each of their strengths, the goal of deploying truly conversational and capable digital assistants that are also trustworthy seems tantalizingly close. What are the remaining challenges for this goal, and how can the ML and NLP communities come together to overcome them? The goal of this workshop is to bring together machine learning researchers and dialogue researchers from academia and industry to encourage knowledge transfer and collaboration on these central questions to discover ideas that can further expand the use cases of conversational AI. The ideal outcome of the workshop is to identify a set of concrete research directions to enable the next generation of digital assistants.
Schedule
Sat 12:00 p.m. - 12:15 p.m.
|
Opening Remarks
(
Opening Remarks
)
SlidesLive Video |
🔗 |
Sat 12:15 p.m. - 1:00 p.m.
|
Invited Talk: New Frontiers in the Evaluation of Conversational Agents by João Sedoc
(
Keynote Talk
)
SlidesLive Video The rapid advances in large language models brought about disruptive innovations in the field of conversational agents. However, recent advances also present new challenges in evaluating the quality of such systems, as well as the underlying models and methods. As conversational agents increasingly match and or even surpass human performance in dimensions like 'coherence,' we must shift our focus to the qualities of conversational agents that are fundamental to human-like conversation (e.g., empathy and emotion). In this talk, I will focus on how we can integrate psychological metrics for evaluating conversational agents along dimensions such as emotion, empathy, and user traits. I will also introduce our Item Response Theory (IRT) framework, an innovative approach for evaluating the quality of agents across various dimensions. Finally, I will discuss future directions of conversational agent evaluation. |
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Poster & Demo Session
(
Poster
)
|
🔗 |
Sat 1:30 p.m. - 2:15 p.m.
|
Invited Talk: Improving Open Language Models by Learning from Organic Interactions by Jason Weston
(
Keynote Talk
)
SlidesLive Video We discuss techniques that can be used to learn how to improve AIs (dialogue models) by interacting with organic users ``in the wild''. Training models with organic data is challenging because such interactions include both high quality conversations and feedback, as well as adversarial and toxic behavior. We thus study techniques that enable learning from helpful teachers while avoiding learning from people who are trying to trick the model into unhelpful or toxic responses. We present BlenderBot 3x, an update on the conversational model BlenderBot 3, trained on 6M such interactions from participating users of the system, which we also publicly release. BlenderBot 3x is both preferred in conversation to BlenderBot 3, and is shown to produce safer responses in challenging situations. We then discuss how we believe continued use of these techniques -- and improved variants -- can lead to further gains. |
🔗 |
Sat 2:15 p.m. - 3:00 p.m.
|
Invited Talk: Building a dialogue agent for Diplomacy by Emily Dinan
(
Keynote Talk
)
SlidesLive Video Last November we announced Cicero, the first AI agent capable of playing the board game Diplomacy at a human level. In this talk, we'll focus on the language-related aspects of this work, and in particular, how we built a dialogue agent capable of negotiating, coordinating, and strategizing with other humans through natural language in this complex seven-player game. |
🔗 |
Sat 3:00 p.m. - 4:30 p.m.
|
Lunch Break
(
Break
)
|
🔗 |
Sat 4:30 p.m. - 5:15 p.m.
|
Invited Talk: LLMs with long-term memory and better factuality by Zhou Yu
(
Keynote Talk
)
SlidesLive Video Seamlessly communicating with machines has always been the ultimate goal of artificial intelligence. This talk addresses the two key milestones towards general intelligence: how to effectively track infinite long history and improve the generation content's factuality. Specifically, we will talk about a stateful transformer architecture that can achieve effective memory read and write. We will also address how to use reinforcement learning with human feedback to improve generation content's faithfulness. |
🔗 |
Sat 5:15 p.m. - 6:00 p.m.
|
Invited Talk: Embeddings and Retrieval Augmented Generation by Arvind R Neelakantan
(
Keynote Talk
)
Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. This talk will first focus on our work on embeddings that are useful to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Then we will dive deeper into the application of embeddings for retrieval augmented generation with language models. |
Arvind Neelakantan 🔗 |
Sat 6:00 p.m. - 6:30 p.m.
|
Poster & Demo Session
(
Poster
)
|
🔗 |
Sat 6:30 p.m. - 7:15 p.m.
|
Invited Talk: Safer Generative ConvAI by Pascale Fung
(
Keynote Talk
)
SlidesLive Video Generative models for Conversational AI are less than a decade old, but they hold great promise for human-machine interactions. Machine responses based on generative models can seem quite fluent and human-like, empathetic and funny, knowledgeable and professional. However, behind the confident voice of generative ConvAI systems, they can also be hallucinating misinformation, giving biased and harmful views, and are still not "safe" enough for many real life applications. The expressive power of generative ConvAI models and their undesirable behaviors are two sides of the same coin. How can we harness the fluency, diversity, engagingness of generative ConvAI models while mitigating the downside? In this talk, I will present some of our recent work in making generative ConvAI safer via mitigating hallucinations, misinformation, and toxicity. |
Pascale FUNG 🔗 |
Sat 7:15 p.m. - 8:00 p.m.
|
Invited Talk: Neuro-Symbolic Dialogue Management using Prompt-Based Transfer Learning for Dialogue Act Controlled Open-Domain NLG by Marilyn Walker
(
Keynote Talk
)
SlidesLive Video In order to create interesting and engaging conversational interactions with users, open domain SocialBots need to interact using a range of dialogue acts (DAs). For example, a SocialBot should be able to ask factual and opinion questions, inform the user of facts and express opinions, agree and disagree with the user, provide appraisals and acknowledgements, make recommendations or suggestions, and confirm what the user said. For many applications it is also necessary to ground these DAs in knowledge of some kind, either structured or unstructured. In the past, such dialogue-act controlled response generation was typically trained from a large paired corpus that maps from a domain-specific meaning representation that specifies the desired DA and associated attributes, to one or more reference utterances. However recent advances in pretrained language models offer new possibilities for semantically controlled NLG. Here we show that we can achieve near perfect DA and semantic attribute control using Prompt-Based Transfer learning (PBL). We apply an overgenerate and rank method to compare eight few-shot prompt styles that include a novel method of generating from textual pseudo-references using a textual style transfer approach, a second novel approach that provides definitions of DAs in the prompts, inspired by previous work on schema-guided NLG, and a baseline of simply linearizing the MR. To our knowledge, this is the first work on NLG for dialogue that automatically evaluates and ranks outputs using DA accuracy. We then show that we can use PBL to successfully transfer these conversational DAs from WikiData triples in one domain, namely Video Games, to Wikidata triples in three other domains, namely Music, Movies and TV, providing a universal dialogue policy that can be used across all 4 domains in Athena, UCSC's Alexa Prize SocialBot. |
Marilyn Walker 🔗 |
Sat 8:00 p.m. - 8:15 p.m.
|
Closing Remarks
(
Closing Remarks
)
SlidesLive Video |
🔗 |
-
|
DiversiGATE: A Comprehensive Framework for Reliable Large Language Models
(
Poster
)
In this paper, we introduce "DiversiGATE", a unified framework that consolidates diverse methodologies for LLM verification. The proposed framework comprises two main components: Diversification and Aggregation which provide a holistic perspective on existing verification approaches, such as Self-Consistency, Math Prompter and WebGPT. Furthermore, we propose a novel "SelfLearner" model that conforms to the DiversiGATE framework which can learn from its own outputs and refine its performance over time, leading to improved accuracy. To evaluate the effectiveness of SelfLearner, we conducted a rigorous series of experiments, including tests on synthetic data as well as on popular arithmetic reasoning benchmarks such as GSM8K. Our results demonstrate that our approach outperforms traditional LLMs, achieving a considerable 54.8%-> 61.8% improvement on the GSM8K benchmark. |
Shima Imani · Ali Beyram · Harsh Shrivastava 🔗 |
-
|
Can Large Language Models Reason Algorithmically in an Interactive Environment?
(
Poster
)
We are proposing a novel benchmark to evaluate the performance of a large language model to reason following a certain algorithmic procedure such as depth first search (DFS).Our evaluation protocol is designed to be interactive, for example in DFS, the edge connected to one node will only be availble to the tested model after the model has reached this node.Thus, in order to perform such a DFS procedure, the model will need to be able to maintain a memory of which nodes have been visited, and reason about the next node it will go to.We create similar interaction environment with three different algorithms, namely binary search, depth-first search, and breadth-first search.We evaluate the algorithmic reasoning ability of six models using our proposed benchmark and found that there still exists a significant gap between open-sourced Vicuna-13B and the GPT-3.5 model.We hope our benchmark and the experimental findings inspire future works on the direction of algorithmic reasoning in large language models. |
Siwei Yang · Yi Xu · Shitong Xu · Zhongkai Zhao · Bingchen Zhao 🔗 |
-
|
AutoML-GPT: Large Language Model for AutoML
(
Poster
)
With the emerging trend of GPT models, we establish a framework, AutoML-GPT, integrates with a comprehensive set of tools and libraries, granting access to a wide range of data preprocessing techniques, feature engineering methods, and model selection algorithms. Users can specify their requirements, constraints, and evaluation metrics through a conversational interface.Throughout the process, AutoML-GPT employs advanced techniques for hyperparameter optimization, and model selection, ensuring that the resulting model achieves optimal performance. The system effectively manages the complexity of the machine learning pipeline, guiding users towards the best choices without requiring deep domain knowledge.Through our experimental results on diverse datasets, we demonstrate that AutoML-GPT significantly reduces the time and effort required for machine learning tasks. Its ability to leverage the vast knowledge encoded in large language models enables it to provide valuable insights, identify potential pitfalls, and suggest effective solutions to common challenges faced during model training. |
Yun Da Tsai · Yu-Che Tsai · Bo-Wei Huang · Chun-Pai Yang · Shou-De Lin 🔗 |
-
|
TRAC: Trustworthy Retrieval Augmented Chatbot
(
Poster
)
Although neural conversational AIs have demonstrated fantastic performance, they often generate incorrect information, or \textit{hallucinations}. Retrieval augmented generation has emerged as a promising solution to reduce these hallucinations. However, these techniques still cannot guarantee correctness. Focusing on question answering, we propose a framework that can provide statistical guarantees for the retrieval augmented question answering system by combining conformal prediction and global testing. In addition, we use Bayesian optimization to choose hyperparameters of the global test to maximize the performance of the system. Our empirical results on the Natural Questions dataset demonstrate that our method can provide the desired coverage guarantee while minimizing the average prediction set size. |
Shuo Li · Sangdon Park · Insup Lee · Osbert Bastani 🔗 |
-
|
LLM2Loss: Leveraging Language Models for Explainable Model Diagnostics
(
Poster
)
Trained on a vast amount of data, Large Language models (LLMs) have achieved unprecedented success and generalization in modeling fairly complex textual inputs in the abstract space, making them powerful tools for zero-shot learning. Such capability is extended to other modalities such as the visual domain using cross-modal foundation models such as CLIP, and as a result, semantically meaningful representations are extractable from visual inputs. In this work, we leverage this capability and propose an approach that can provide semantic insights into a model's patterns of successes, failures, and biases. Given a black box model, its training data, and task definition, we first calculate its task-related loss for each data point. We then extract a semantically meaningful representation for each training data point (such as CLIP embeddings from its visual encoder) and train a lightweight model which maps this semantically meaningful representation of a data point to its task loss. We show that an ensemble of such lightweight models can be used to generate insights on the performance of the black-box model, in terms of identifying its patterns of failures and biases. |
Shervin Ardeshir 🔗 |
-
|
In-Context Exemplars as Clues to Retrieving \\ from Large Associative Memory
(
Poster
)
Recently, large language models (LLMs) have made remarkable progress in natural language processing (NLP). One of the most notable abilities of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. However, there remains limited intuition for how in-context learning works. In this paper, we show that ICL can be highly related to retrieving from a modern Hopfield Network (MHN), a model of associative memory that is biologically plausible. We establish a theoretical interpretation of ICL based on an extension of the framework of MHNs. Based on our theory, we propose an Active Exemplar Selection approach that is more efficient than commonly used selection methods. Furthermore, we empirically investigate the influence of exemplars on ICL for different tasks. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLM. |
Jiachen Zhao 🔗 |
-
|
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
(
Poster
)
Large Language Models (LLMs) are increasingly being integrated into various applications. The functionalities of recent LLMs can be flexibly modulated via natural language prompts. This renders them susceptible to targeted adversarial prompting, e.g., Prompt Injection (PI) attacks. So far, it was assumed that the user is directly prompting the LLM. But, what if it is not the user prompting? We argue that LLM-Integrated Applications blur the line between data and instructions. We reveal new attack vectors, using Indirect Prompt Injection, that enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities, including data theft, worming, information ecosystem contamination, and other novel security risks. We demonstrate our attacks' practical viability against both real-world systems, such as Bing's GPT-4 powered Chat and code-completion engines, and synthetic applications built on GPT-4. We show how processing retrieved prompts can act as arbitrary code execution, manipulate the application's functionality, and control how and if other APIs are called. Despite the increasing integration and reliance on LLMs, effective mitigations of these emerging threats are currently lacking. By raising awareness of these vulnerabilities and providing key insights into their implications, we aim to promote the safe and responsible deployment of these powerful models and the development of robust defenses that protect users from potential attacks. |
Sahar Abdelnabi · Kai Greshake · Shailesh Mishra · Christoph Endres · Thorsten Holz · Mario Fritz 🔗 |
-
|
Large Language Models can Share Images, Too!
(
Poster
)
This paper explores the image-sharing capability of Large Language Models (LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting, without the help of visual foundation models.Inspired by the two-stage process of image-sharing in human dialogues, we propose a two-stage framework that allows LLMs to predict potential image-sharing turns and generate related image descriptions using our effective restriction-based prompt template.With extensive experiments, we unlock the image-sharing capability of LLMs in zero-shot prompting, with GPT-4 achieving state-of-the-art performance.Additionally, we uncover the emergent image-sharing ability in zero-shot prompting, demonstrating the effectiveness of restriction-based prompts in both stages of our framework.Based on this framework, we augment the PhotoChat dataset with images generated by Stable Diffusion at predicted turns, namely PhotoChat++.To our knowledge, this is the first study to assess the image-sharing ability of LLMs in a zero-shot setting without visual foundation models.The source code and the dataset will be released after publication. |
Young-Jun Lee · Jonghwan Hyeon · Ho-Jin Choi 🔗 |
-
|
Conformal Prediction with Large Language Models for Multi-Choice Question Answering
(
Poster
)
As large language models continue to be developed at scale, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy, an observation that can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required. |
Charles Lu · Bhawesh Kumar · Gauri Gupta · Anil Palepu · David Bellamy · Ramesh Raskar · Andrew Beam 🔗 |
-
|
Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents
(
Poster
)
We investigate the challenge of task planning for multi-task embodied agents in open-world environments. Two main difficulties are identified: 1) executing plans in an open-world environment (e.g., Minecraft) necessitates accurate and multi- step reasoning due to the long-term nature of tasks, and 2) as vanilla planners do not consider how easy the current agent can achieve a given sub-task when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient or even infeasible. To this end, we propose “Describe, Explain, Plan and Select” (DEPS), an interactive planning approach based on Large Language Models (LLMs). DEPS facilitates better error correction on initial LLM-generated plan by integrating description of the plan execution process and providing self-explanation of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal selector, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly doubles the overall performances. Further testing reveals our method’s general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the ObtainDiamond grand challenge with our approach. |
Zihao Wang · Shaofei Cai · Guanzhou Chen · Anji Liu · Xiaojian Ma · Yitao Liang 🔗 |
-
|
Teaching Arithmetic to Small Transformers
(
Poster
)
Large language models like GPT-4 exhibit emergent general-purpose capabilities, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from scratch, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective.We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that include intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities. |
Nayoung Lee · Kartik Sreenivasan · Jason Lee · Kangwook Lee · Dimitris Papailiopoulos 🔗 |
-
|
Situated Interaction with Real-Time State Conditioning of Language Models
(
Poster
)
Recent advances in large language model fine-tuning datasets and techniques have made them flourish as general dialogue-based assistants that are well-suited to strictly turn-based interactions. However, maintaining consistency in long-range, multi-turn dialogues remains a challenge with many applications restricting conversations to a short window. Current multi-modal vision-based interactions are also limited to turn-based interactions on a static sequence of tokenized images with VQA-style referential querying. In this work, we present an approach for performing real-time, vision-based dynamic interaction with an auto-regressive language model. Our approach enables long-range consistency through continual visual grounding of language model inputs. Grounding makes use of a winnowing mechanism to reduce a raw stream of pixels hierarchically, to a series of discrete events as conditioning variables for the language model. We present a novel dataset and benchmark for situated, visual interaction in the form of exercise coaching, and show that our approach can generate relevant and useful responses grounded in a real-time camera stream. |
Sunny Panchal · Guillaume Berger · Antoine Mercier · Cornelius Böhm · Florian Dietrichkeit · Xuanlin Li · Reza Pourreza · Pulkit Madan · Apratim Bhattacharyya · Mingu Lee · Mark Todorovich · Ingo Bax · Roland Memisevic
|
-
|
LLM Guided Inductive Inference for Solving Compositional Problems
(
Poster
)
While large language models (LLMs) have demonstrated impressive performancein question-answering tasks, their performance is limited when the questions requireknowledge that is not included in the model’s training data and can only be acquiredthrough direct observation or interaction with the real world. Existing methods decompose reasoning tasks through the use of modules invoked sequentially, limiting their ability to answer deep reasoning tasks. We introducea method, Recursion based extensible LLM (REBEL), which handles open-world, deep reasoning tasks by employing automated reasoning techniques like dynamic planning and forward-chaining strategies. REBEL allows LLMs to reason via recursive problem decomposition and utilization of external tools. The tools that REBEL uses are specified only by natural language description. We further demonstrate REBEL capabilities on a set of problems that require a deeply nested use of external tools in a compositional and conversational setting. |
Abhigya Sodani · Lauren Moos · Matthew Mirman 🔗 |
-
|
Scalable Conversational Moderation: Promoting Constructive Dialogue to Reduce Online Polarization
(
Poster
)
As the number of online users continues to grow and societal polarization deepens, there is a growing need for effective moderation strategies that can scale alongside these trends. Traditional approaches to automated moderation, such as banning or deleting comments, often exacerbate polarization by driving users toward echo chambers. In this paper, we propose a novel approach to automatic moderation called conversational moderation that leverages conversational AI as moderators to create a more accommodating and constructive online environment. In this paper, we propose a novel approach to automatic moderation called conversational moderation that leverages conversational AI as moderators to create a more accommodating and constructive online environment. In this paper, we present the first study that leverages large language models as conversational AI to function as moderators and evaluates their performance in guiding simulated continuations of controversial conversations on Reddit towards more constructive outcomes. We take an iterative strategy to prompt engineering using self-talk to adapt large language models as various types of moderator bots. Our preliminary experiments reveal that prompts integrating conflict resolution and effective communication techniques can yield improvements in coherency and understandingness, but the high level of subjectivity in this task renders these results statistically insignificant. Our findings thus far demonstrate that even state-of-the-art language models often repeating boilerplate guidelines and thus fail to effectively conduct conversational moderation. |
Hyundong Cho · Jonathan May 🔗 |
-
|
Trust and ethical considerations in a multi-modal, explainable AI-driven chatbot tutoring system: The case of collaboratively solving Rubik’s Cube
(
Poster
)
Artificial intelligence (AI) has the potential of transforming education with its power of uncovering insights from massive data about student learning patterns. However, ethical and trustworthy concerns about AI have been raised but are unsolved. Prominent ethical issues in high school AI education include data privacy, information leakage, abusive language, and fairness. This paper describes technological components that were built to address ethical and trustworthy concerns in a multi-modal collaborative platform (called ALLURE chatbot) for high school students to collaborate with AI to solve the Rubik’s cube. In data privacy, we want to ensure that the informed consent of children or parents, and teachers, are at the center of any data that is managed. Since children are involved, language, whether textual, audio or visual, is acceptable both from users and AI and the system is able to steer interaction away from dangerous situations. In information management, we also want to ensure that the system, while learning to improve over time, does not leak information about users from one group to another. |
Kausik Lakkaraju · Vedant Khandelwal · Biplav Srivastava · Forest Agostinelli · Hengtao Tang · Prathamjeet Singh · Dezhi Wu · Matt Irvin · Ashish Kundu 🔗 |
-
|
Can Chatbots “Understand”? Evidence of Meaning in Language Models Trained on Programs
(
Poster
)
We present evidence that language models can learn meaning despite being trained only to perform next token prediction on text, specifically a corpus of programs. Each program is preceded by a specification in the form of (textual) input-output examples. Working with programs enables us to precisely define concepts relevant to meaning in language (e.g., correctness and semantics), making program synthesis well-suited as an intermediate testbed for characterizing the presence (or absence) of meaning in language models.We first train a Transformer model on the corpus of programs, then probe the trained model's hidden states as it completes a program given a specification. Despite providing no inductive bias toward learning the semantics of the language, we find that a linear probe is able to extract abstractions of both current and future program states from the model states. We also demonstrate that the model learns to generate correct programs that are, on average, shorter than those in the training set, which is evidence that language model outputs may differ from the training distribution in semantically meaningful ways. In summary, this paper does not propose any new techniques for improving current language models, but develops an experimental framework for and provides insights into the acquisition and representation of (formal) meaning in language models. |
Charles Jin · Martin Rinard 🔗 |
-
|
Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning
(
Poster
)
Language models still struggle on moral reasoning, despite their impressive performance in many other tasks. In particular, the Moral Scenarios task in MMLU (Multi-task Language Understanding) is among the worst performing tasks for many language models, including GPT-3. In this work, we propose a new prompting framework, Thought Experiments, to teach language models to do better moral reasoning using counterfactuals. Experiment results show that our framework elicits counterfactual questions and answers from the model, which in turn helps improve the accuracy on Moral Scenarios task by 9% to 16% compared to other zero-shot baselines. Interestingly, unlike math reasoning tasks, zero-shot Chain-of-Thought (CoT) reasoning doesn’t work out of the box. Zero-shot CoT reduces accuracy by around 4% compared to direct zero-shot. We further observed that with minimal human supervision in the form of 5 few-shot examples, the accuracy of the task can be improved to as much as 80.45%. |
Xiao Ma · Swaroop Mishra · Ahmad Beirami · Alex Beutel · Jilin Chen 🔗 |
-
|
Robustness through Loss Consistency Regularization
(
Poster
)
In the continually evolving landscape of Natural Language Processing (NLP), enhancing the robustness and resilience of deep learning models is critical. Traditional models employ Empirical Risk Minimization (ERM), but its susceptibility to distribution shifts and adversarial attacks undermines its efficacy. To address these limitations, many utilize Data Augmentation followed by ERM (DA-ERM) and consistency regularization. Unfortunately, these methods are not applicable to covariant data augmentation, where the label of the augmented data hinges on the augmentation process itself and hence cannot work on generative models. In this paper, we present a novel technique called Data Augmented Loss Invariant Regularization (DAIR), which operates directly at the loss level, circumventing the restrictions of conventional methods and extending its applicability to covariant data augmentation. Importantly, DAIR's robustness is independent of network architecture, problem setup, or task, thereby expanding its suitability for a broad range of NLP challenges. Finally, our experiments on Task-Oriented Dialog highlight DAIR's superiority over conventional methods, setting new benchmarks in NLP tasks with minimal extra computational cost. |
Tianjian Huang · Shaunak A Halbe · Chinnadhurai Sankar · Pooyan Amini · Satwik Kottur · Alborz Geramifard · Meisam Razaviyayn · Ahmad Beirami 🔗 |
-
|
Idiolect: A Reconfigurable Voice Coding Assistant
(
Poster
)
Idiolect is an open source (https://github.com/OpenASR/idiolect) tool for voice coding and a novel approach to building chatbots that lets users define custom commands and grammars on-the-fly. Unlike traditional chatbots, Idiolect does not impersonate an obedient assistant, but instead offers a highly-configurable interface for automating repetitive programming tasks. Featuring advanced support for error correction and audio-visual feedback, Idiolect promises to enhance the usability and accessibility of manual developer tools by offering an alternate input method for keyboardless programming. The following paper explores the mechanics of voice programming in Idiolect, and sheds light into the challenges faced and insights gained during its development. |
Breandan Considine · Nicholas Albion · Xujie Si 🔗 |
-
|
LMQL Chat: Scripted Chatbot Development
(
Poster
)
We introduce LMQL Chat, a powerful open-source framework for building interactive systems on top of large language models, making it easy to create conversational agents with features like tool usage, internal reflection or safety constraints.We provide a video demonstration at https://drive.google.com/file/d/1lCedYCdsLkHgF29-MAIBltXBREeR6QJt/view?usp=sharing. |
Luca Beurer-Kellner · Marc Fischer · Martin Vechev 🔗 |
-
|
Assessing Spoken Language Understanding Pipeline of a Multimodal Dialogue System for Kids Learning Math at Home
(
Poster
)
Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based math learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for children, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. We validate the advantages of a multi-task architecture for NLU and experiment with a diverse set of pretrained language representations for Intent Recognition and Entity Extraction tasks in the math learning domain. To recognize kids' speech in realistic home environments, we investigate several ASR systems, including the commercial Google Cloud and the latest open-source Whisper solutions with varying model sizes. We evaluate the SLU pipeline by testing our best-performing NLU models on noisy ASR output to inspect the challenges of understanding children for math learning in authentic homes. |
Eda Okur · Roddy Fuentes Alba · Saurav Sahay · Lama Nachman 🔗 |
-
|
Disclosing the Biases in Large Language Models via Reward Based Questioning
(
Poster
)
The success of large language models has been utterly demonstrated in recent times. Using these models and fine tuning for the specific task at hand results in high performance. However, these models also learn biased representations from the data they have been trained on. In particular, several studies recently showed that language models can learn to be biased towards certain genders. Quite recently, several studies tried to eliminate this bias via proposing human feedback included in fine-tuning. In our study we show that by changing the question asked to the language model the log probabilities of the bias measured in the responses changes dramatically. Furthermore, in several cases the language model ends up providing a completely opposite response. The recent language models finetuned on the prior gender bias datasets do not resolve the actual problem, but rather alleviate the problem for the dataset on which the model is fine-tuned. We believe our results might lay the foundation for further alignment and safety problems in large language models. |
Ezgi Korkmaz 🔗 |