Artificial intelligence (AI) and Human Computer Interaction (HCI) share common roots: early work on conversational agents has laid the foundation for both fields. However, economic and political influences have driven these fields to remain separate in subsequent decades. The recent rise of data-centric methods in machine learning has propelled few-shot emergent AI capabilities, resulting in a raft of practical tools. In particular, modern AI techniques now power new ways for machines and humans to interact. Recently, a wave of HCI tasks have been proposed to the machine learning community, which direct AI research by contributing new datasets and benchmarks, and challenging existing modeling techniques, learning methodologies, and evaluation protocols. This workshop offers a forum for researchers to discuss these new research directions, identifying important challenges, showcasing new computational and scientific ideas that can be applied, sharing datasets/tools that are already available, or proposing those that should be further developed.
Sat 12:00 p.m. - 12:10 p.m.
|
Meet & Setup for Morning Poster Session
|
🔗 |
Sat 12:10 p.m. - 12:20 p.m.
|
Opening Remarks
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 12:20 p.m. - 1:00 p.m.
|
“AI For Good” Isn’t Good Enough: A Call for Human-Centered AI by James Landay
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 1:00 p.m. - 1:10 p.m.
|
Break
|
🔗 |
Sat 1:10 p.m. - 1:50 p.m.
|
Designing Easy and Useful Human Feedback by Anca Dragan
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 1:50 p.m. - 2:50 p.m.
|
Morning Poster Session
(
Poster Session
)
|
🔗 |
Sat 2:50 p.m. - 4:30 p.m.
|
Lunch
|
🔗 |
Sat 4:30 p.m. - 4:40 p.m.
|
Setup for Afternoon Poster Session
|
🔗 |
Sat 4:40 p.m. - 5:20 p.m.
|
Beyond RLHF: A Human-Centered Approach to AI Development and Evaluation by Meredith Ringel Morris
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 5:20 p.m. - 6:00 p.m.
|
Human-Centered AI Transparency: Lessons Learned and Open Questions in the Age of LLMs by Q. Vera Liao
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 6:00 p.m. - 6:10 p.m.
|
Break
|
🔗 |
Sat 6:10 p.m. - 6:50 p.m.
|
Detecting and Countering Untrustworthy Artificial Intelligence by Nikola Banovic
(
Presentation
)
SlidesLive Video » |
🔗 |
Sat 6:50 p.m. - 7:50 p.m.
|
Afternoon Poster Session
(
Poster Session
)
|
🔗 |
Sat 7:50 p.m. - 8:00 p.m.
|
Closing Remarks
(
Presentation
)
|
🔗 |
-
|
Designing interactions with AI to support the scientific peer review process
(
Morning Poster
)
link »
Peer review processes include a series of activities from review writing to mata-review authoring. Recent advances in AI exhibit the potential to augment complex human writing activities. However, it is still not clear how to design interactive systems that leverage AI to support the scientific peer review process and what are the potential trade-offs. In this paper, we prototype a system – MetaWriter, which uses three forms of AI to support meta-review authoring and offers useful functionalities including review aspect highlights, viewpoint extraction, and hybrid draft generations. In a within-subjects experiment, 32 participants wrote meta-reviews using MetaWriter and a baseline environment with no machine support. We show that MetaWriter can expedite and improve the meta-review authoring process. But participants raised concerns about trust, over-reliance, and agency. We further discuss insights on designing interactions with AI to support the scientific peer review process. |
Lu Sun · Stone Tao · Junjie Hu · Steven Dow 🔗 |
-
|
Symbiotic Co-Creation with AI
(
Morning Poster
)
link »
The quest for symbiotic co-creation between humans and artificial intelligence (AI) has receivedconsiderable attention in recent years. This paper explores the challenges and opportunities associated with human-AI interaction, focusing on the unique qualities that distinguish symbiotic interactions from conventional human-to-tool relationships. The role of representation learning and multimodal models in enabling symbiotic co-creation is discussed, emphasising their potentialto overcome the limitations of language and tap into deeper layers of symbolic representation. In addition, the concept of AI as design material is explored, highlighting how the latent spatial representation of generative models becomes a field of possibilities for human creators. It also explores novel creative affordances of AI interfaces, including combinational, exploratory and transformational creativity. The paper concludes by highlighting the transformative potential of AI in enhancing human creativity and shaping new frontiers of collaborative creation. |
Ninon Lizé Masclef 🔗 |
-
|
Discovering User Types: Characterization of User Traits by Task-Specific Behaviors in Reinforcement Learning
(
Morning Poster
)
link »
We often want to infer user traits when personalizing interventions. Approaches like Inverse RL can learn traits formalized as parameters of a Markov Decision Process but are data intensive. Instead of inferring traits for individuals, we study the relationship between RL worlds and the set of user traits. We argue that understanding the breakdown of ``user types" within a world -- broad sets of traits that result in the same behavior -- helps rapidly personalize interventions. We show that seemingly different RL worlds admit the same set of user types and formalize this observation as an equivalence relation defined on worlds. We show that these equivalence classes capture many different worlds. We argue that the richness of these classes allows us to transfer insights on intervention design between toy and real worlds. |
Lars L. Ankile · Brian Ham · Kevin Mao · Eura Shin · Siddharth Swaroop · Finale Doshi-Velez · Weiwei Pan 🔗 |
-
|
Exploring Open Domain Image Super-Resolution through Text
(
Morning Poster
)
link »
In this work, we propose for the first time a zero-shot approach for flexible open domain extreme super-resolution of images which allows users to interactively explore plausible solutions by using language prompts. Our approach exploits a recent diffusion based text-to-image (T2I) generative model. We modify the generative process of the T2I diffusion model to analytically enforce data consistency of the solution and explore diverse contents of null-space using text guidance. Our approach results in diverse solutions which are simultaneously consistent with input text and the low resolution images. |
Kanchana Vaishnavi Gandikota · Paramanand Chandramouli 🔗 |
-
|
An Interactive Human-Machine Learning Interface for Collecting and Learning from Complex Annotations
(
Morning Poster
)
link »
Human-Computer Interaction has been shown to lead to improvements in machine learning systems by boosting model performance, accelerating learning and building user confidence. In this work, we propose a human-machine learning interface for binary classification tasks with the goal of allowing humans to provide richer forms of supervision and feedback that go beyond standard binary labels as annotations for a dataset. We aim to reverse the expectation that human annotators adapt to the constraints imposed by labels, by allowing for extra flexibility in the form that supervision information is collected. For this, we introduce the concept of task-oriented meta-evaluations and propose a prototype tool to efficiently capture the human insights or knowledge about a task. Finally we discuss the challenges which face future extensions of this work. |
Jonathan Erskine · Raul Santos-Rodriguez · Alexander Hepburn · Matt Clifford 🔗 |
-
|
Exploring Mobile UI Layout Generation using Large Language Models Guided by UI Grammar
(
Morning Poster
)
link »
The recent advances in Large Language Models (LLMs) have stimulated interest among researchers and industry professionals, particularly in their application to tasks concerning mobile user interfaces (UIs). This position paper investigates the use of LLMs for UI layout generation. Central to our exploration is the introduction of \textit{UI grammar} –– a novel approach we proposed to represent the hierarchical structure inherent in UI screens. The aim of this approach is to guide the generative capacities of LLMs more effectively and improve the explainability and controllability of the process. Initial experiments conducted with GPT-4 showed the promising capability of LLMs to produce high-quality user interfaces via in-context learning. Furthermore, our preliminary comparative study suggested the potential of the grammar-based approach in improving the quality of generative results in specific aspects. |
Yuwen Lu · Ziang Tong · Anthea Zhao · Chengzhi Zhang · Toby Li 🔗 |
-
|
Do Users Write More Insecure Code with AI Assistants?
(
Morning Poster
)
link »
We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve avariety of security related tasks across different programminglanguages. Overall, we find that participants who had accessto an AI assistant based on OpenAI’s \texttt{codex-davinci-002}model wrote less secure code than those withoutaccess. Additionally, participants with access to an AI assistantwere more likely to believe they wrote secure code than thosewithout access to the AI assistant. Furthermore, we find thatparticipants who trusted the AI less and engaged more withthe language and format of their prompts (e.g. re-phrasing,adjusting temperature) provided code with fewer securityvulnerabilities. Finally, in order to better inform the designof future AI Assistants, we provide an in-depthanalysis of participants’ language and interaction behavior, aswell as release our user interface as an instrument to conductsimilar studies in the future. |
Megha Srivastava 🔗 |
-
|
Give Weight to Human Reactions: Optimizing Complementary AI in Practical Human-AI Teams
(
Morning Poster
)
link »
With the rapid development of decision aids that are driven by AI models, the practice of human-AI joint decision making has become increasingly prevalent. To improve the human-AI team performance in decision making, earlier studies mostly focus on enhancing humans' capability in better utilizing a given AI-driven decision aid. In this paper, we tackle this challenge through a complementary approach---we aim to adjust the designs of the AI model underlying the decision aid by taking humans' reaction to AI into consideration. In particular, as humans are observed to accept AI advice more when their confidence in their own decision is low, we propose to train AI models with a human-confidence-based instance weighting strategy, instead of solving the standard empirical risk minimization problem. Under an assumed, threshold-based model characterizing when humans will adopt the AI advice, we first derive the optimal instance weighting strategy for training AI models. We then validate the efficacy of our proposed method in improving the human-AI joint decision making performance through systematic experimentation on both synthetic and real-world datasets. |
Hasan Amin · Zhuoran Lu · Ming Yin 🔗 |
-
|
Adaptive interventions for both accuracy and time in AI-assisted human decision making
(
Morning Poster
)
link »
In settings where users are both time-pressured and need high accuracy, such as doctors working in Emergency Rooms, we want to provide AI assistance that both increases accuracy and reduces time. However, different types of AI assistance have different benefits: some reduce time taken while increasing overreliance on AI, while others do the opposite. We therefore want to adapt what AI assistance we show depending on various properties (of the question and of the user) in order to best tradeoff our two objectives. We introduce a study where users have to prescribe medicines to aliens, and use it to explore the potential for adapting AI assistance. We find evidence that it is beneficial to adapt our AI assistance depending on the question, leading to good tradeoffs between time taken and accuracy. Future work would consider machine-learning algorithms (such as reinforcement learning) to automatically adapt quickly. |
Siddharth Swaroop · Zana Buçinca · Krzysztof Gajos · Finale Doshi-Velez 🔗 |
-
|
Iterative Disambiguation: Towards LLM-Supported Programming and System Design
(
Morning Poster
)
link »
LLMs offer unprecedented capabilities for generating code and prose; creating systems that take advantage of these capabilities can be challenging. We propose an artifact-centered iterative disambiguation process for using LLMs to iteratively refine an LLM-based system of subcomponents, each of which is in turn defined and/or implemented by an LLM. A system implementing this process could expand the experience of end-user computing to include user-defined programs capable of nearly any computable activity; here, we propose one approach to explore iterative disambiguation for end-user system design. |
J.D. Zamfirescu-Pereira · Bjorn Hartmann 🔗 |
-
|
HateXplain2.0: An Explainable Hate Speech Detection Framework Utilizing Subjective Projection from Contextual Knowledge Space to Disjoint Concept Space
(
Morning Poster
)
link »
Finetuning large pre-trained language models on specific datasets is a popular approach in (Natural Language Processing) NLP classification tasks. However, this can lead to overfitting and reduce model explainability. In this paper, we propose a framework that uses the projection of sentence representations onto task-specific conceptual spaces for improved explainability. Each conceptual space corresponds to a class and is learned through a transformer operator optimized during classification tasks. The dimensions of the concept spaces can be trained and optimized. Our framework shows that each dimension is associated with specific words which represent the corresponding class. To optimize the training of the operators, we introduce intra- and inter-space losses. Experimental results on two datasets demonstrate that our model achieves better accuracy and explainability. On the HateXplain dataset, our model shows at least a 10\% improvement in various explainability metrics. |
Md Fahim · Md Shihab Shahriar · Sabik Irbaz · Syed Ishtiaque Ahmed · Mohammad Ruhul Amin 🔗 |
-
|
Active Reinforcement Learning from Demonstration in Continuous Action Spaces
(
Morning Poster
)
link »
Learning from Demonstration (LfD) is a human-in-the-loop paradigm that aims to overcome the limitations of safety considerations and weak data efficiency in Reinforcement Learning (RL). Active Reinforcement Learning from Demonstration (ARLD) takes LfD a step further by actively involving the human expert only during critical moments, reducing the costs associated with demonstrations. While successful ARLD strategies have been developed for RL environments with discrete actions, their potential in continuous action environments has not been thoroughly explored. In this work, we propose a novel ARLD strategy specifically designed for continuous environments. Our strategy involves estimating the uncertainty of the current RL agent directly from the variance of the stochastic policy within the state-of-the-art Soft Actor-Critic RL model. We demonstrate that our strategy outperforms both a naive attempt to adapt existing ARLD strategies to continuous environments and the passive LfD strategy. These results validate the potential of ARLD in continuous environments and lay the foundation for future research in this direction. |
Ming-Hsin Chen · Si-An Chen · Hsuan-Tien (Tien) Lin 🔗 |
-
|
SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text
(
Morning Poster
)
link »
A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for dimensionality reduction of text corpora, like Latent Dirichlet Allocation (LDA), often produce projections that do not capture human notions of document similarity. We propose SAP-sLDA, a semi-supervised human-in-the-loop method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections. On synthetic corpora, SAP-sLDA yields more interpretable projections than baseline methods with only a fraction of labels provided. On a real corpus, we obtain qualitatively similar results. |
Charumathi Badrinath · Weiwei Pan · Finale Doshi-Velez 🔗 |
-
|
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
(
Morning Poster
)
link »
Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose \textit{Spotlight}, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen---the \textit{focus}---as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction. |
Gang Li · Yang Li 🔗 |
-
|
Demystifying the Role of Feedback in GPT Self-Repair for Code Generation
(
Morning Poster
)
link »
Large Language Models (LLMs) have shown remarkable aptitude in generating code from natural language specifications, but still struggle on challenging programming tasks. Self-repair---in which the user provides executable unit tests and the model uses these to debug and fix mistakes in its own code---may improve performance in these settings without significantly altering the way in which programmers interface with the system. However, existing studies on how and when self-repair works effectively have been limited in scope, and one might wonder how self-repair compares to keeping a software engineer in the loop to give feedback on the code model's outputs. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. We find that when the cost of generating both feedback and repaired code is taken into account, performance gains from self-repair are marginal and can only be seen with GPT-4. In contrast, when human programmers are used to provide feedback, the success rate of repair increases by as much as 57%. These findings suggest that self-repair still trails far behind what can be achieved with a feedback-giving human kept closely in the loop. |
Theo X. Olausson · Jeevana Priya Inala · Chenglong Wang · Jianfeng Gao · Armando Solar-Lezama 🔗 |
-
|
Semi-supervised Concept Bottleneck Models
(
Morning Poster
)
link »
Concept bottleneck models (CBMs) enhance the interpretability of deep neural networks by adding a concept layer between the input and output layers.However, this improvement comes at the cost of labeling concepts, which can be prohibitively expensive. To tackle this issue, we develop a semi-supervised learning (SSL) approach to CBMs that can make accurate predictions given only a handful of concept annotations.Our approach incorporates a strategy for effectively regulating erroneous pseudo-labels within the standard SSL approaches.We conduct experiments on a range of labeling scenarios and present that our approach can reduce the labeling cost quite significantly without sacrificing the prediction performance. |
Jeeon Bae · Sungbin Shin · Namhoon Lee 🔗 |
-
|
Towards Mitigating Spurious Correlations in Image Classifiers with Simple Yes-no Feedback
(
Morning Poster
)
link »
Modern deep learning models have achieved remarkable performance. However, they often rely on spurious correlations between data and labels that exist only in the training data, resulting in poor generalization performance. We present CRAYON (Correlation Rectification Algorithms by Yes Or No), effective, scalable, and practical solutions to refine models with spurious correlations using simple yes-no feedback on model interpretations. CRAYON addresses key limitations of existing approaches that heavily rely on costly human intervention and empowers popular model interpretation techniques to mitigate spurious correlations in two distinct ways: CRAYON-ATTENTION guides saliency maps to focus on relevant image regions, and CRAYON-PRUNING prunes irrelevant neurons to remove their influence. Extensive evaluation on three benchmark image datasets and three state-of-the-art methods demonstrates that our methods effectively mitigate spurious correlations, achieving comparable or even better performance than existing approaches that require more complex feedback. |
Seongmin Lee · Ali Payani · Polo Chau 🔗 |
-
|
Mitigating Label Bias via Decoupled Confident Learning
(
Morning Poster
)
link »
There has been growing attention to algorithmic fairness in contexts where ML algorithms inform consequential decisions for humans. This has led to a surge in methodologies to mitigate algorithmic bias. However, such methodologies largely assume that observed labels in training data are correct. Yet, bias in labels is pervasive across important domains, including healthcare, hiring, and social media content moderation. In particular, human-generated labels are prone to encoding societal biases. While the presence of labeling bias has been discussed conceptually, there is a lack of methodologies to address this problem. We propose a pruning method--Decoupled Confident Learning (DeCoLe)--specifically designed to mitigate label bias. After illustrating its performance on a synthetic dataset, we apply DeCoLe in the context of hate speech detection, where label bias has been recognized as an important challenge, and show that it successfully identifies biased labels and outperforms competing approaches. |
Yunyi Li · Maria De-Arteaga · Maytal Saar-Tsechansky 🔗 |
-
|
Towards Semantically-Aware UI Design Tools: Design, Implementation and Evaluation of Semantic Grouping Guidelines
(
Morning Poster
)
link »
A coherent semantic structure, where semantically-related elements are appropriately grouped, is critical for proper understanding of a UI. Ideally, UI design tools should help designers establish coherent semantic grouping. To work towards this, we contribute five semantic grouping guidelines that capture how human designers think about semantic grouping and are amenable to implementation in design tools. They were obtained from empirical observations on existing UIs, a literature review, and iterative refinement with UI experts' feedback. We validated our guidelines through an expert review and heuristic evaluation; results indicate these guidelines capture valuable information about semantic structure. We demonstrate the guidelines’ use for building systems by implementing a set of computational metrics. These metrics detected many of the same severe issues that human design experts marked in a comparative study. Running our metrics on a larger UI dataset suggests many real UIs exhibit grouping violations. |
Peitong Duan · Bjorn Hartmann · Karina Nguyen · Yang Li · Marti Hearst · Meredith Morris 🔗 |
-
|
Interactively Optimizing Layout Transfer for Vector Graphics
(
Morning Poster
)
link »
Vector graphics are an industry-standard way to represent and share a broad range of visual designs. Designers often explore layout alternatives and generate them by moving and resizing elements. The motivation for this can range from establishing a different visual flow, adapting a design to a different aspect ratio, standardizing spacing, or redirecting the design's visual emphasis. Existing designs can serve as a source of inspiration for layout modification across these goals. However, generating these layout alternatives still requires significant manual effort in rearranging large groups of elements. We present VLT, short for Vector Layout Transfer, a novel graphic design tool that enables flexible transfer of layouts between designs. It provides designers with multiple levels of semantic layout editing controls, powered by automatic graphics correspondence and layout optimization algorithms. |
Jeremy Warner · Shuyao Zhou · Bjorn Hartmann 🔗 |
-
|
Prediction without Preclusion Recourse Verification with Reachable Sets
(
Morning Poster
)
link »
Machine learning models are often used to decide who will receive a loan, a job interview, or a public service. Standard techniques to build these models use features that characterize people but overlook their \emph{actionability}. In domains like lending and hiring, models can assign predictions that are \emph{fixed}—-meaning that consumers who are denied loans and interviews are permanently locked out from access to credit and employment. In this work, we introduce a formal testing procedure to flag models that assign these ``predictions without recourse," called \emph{recourse verification}. We develop machinery to reliably test the feasibility of recourse \emph{for any model} given user-specified actionability constraints. We demonstrate how these tools can ensure recourse and adversarial robustness in real-world datasets and use them to study the infeasibility of recourse in real-world lending datasets. Our results highlight how models can inadvertently assign fixed predictions that permanently bar access and the need to design algorithms that account for actionability when developing models and providing recourse. |
Avni Kothari · Bogdan Kulynych · Lily Weng · Berk Ustun 🔗 |
-
|
Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection
(
Morning Poster
)
link »
Exfiltration of data via email is a serious cybersecurity threat for many organizations. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the model benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the machine learning model in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three UCI benchmarks and on one real-world redacted email data set. |
Jaturong Kongmanee · Mark Chignell · Khilan Jerath · Abhay Raman 🔗 |
-
|
Are Good Explainers Secretly Human-in-the-Loop Active Learners?
(
Morning Poster
)
link »
Explainable AI (XAI) techniques have becomepopular for multiple use-cases in the past fewyears. Here we consider its use in studying modelpredictions to gather additional training data. Weargue that this is equivalent to Active Learning,where the query strategy involves a human-in-the-loop. We provide a mathematical approximationfor the role of the human, and present a generalformalization of the end-to-end workflow. Thisenables us to rigorously compare this use withstandard Active Learning algorithms, while al-lowing for extensions to the workflow. An addedbenefit is that their utility can be assessed viasimulation instead of conducting expensive user-studies. We also present some initial promisingresults. |
Emma Thuong Nguyen · Abhishek Ghose 🔗 |
-
|
Personalized Prediction of Recurrent Stress Events Using Self-Supervised Learning on Multimodal Time-Series Data
(
Morning Poster
)
link »
Chronic stress can significantly affect physical and mental health. The advent of wearable technology allows for the tracking of physiological signals, potentially leading to innovative stress prediction and intervention methods. However, challenges such as label scarcity and data heterogeneity render stress prediction difficult in practice. To counter these issues, we have developed a multimodal personalized stress prediction system using wearable biosignal data. We employ self-supervised learning (SSL) to pre-train the models on each subject’s data, allowing the models to learn the baseline dynamics of the participant’s biosignals prior to fine-tuning the stress prediction task. We test our model on the Wearable Stress and Affect Detection (WESAD) dataset, demonstrating that our SSL models outperform non-SSL models while utilizing less than 5% of the annotations. These results suggest that our approach can personalize stress prediction to each user with minimal annotations. This paradigm has the potential to enable personalized prediction of a variety of recurring health events using complex multimodal data streams. |
Tanvir Islam · Peter Washington 🔗 |
-
|
Human-in-the-Loop Out-of-Distribution Detection with False Positive Rate Control
(
Morning Poster
)
link »
Robustness to Out-of-Distribution (OOD) samples is essential for the successful deployment of machine learning models in the open world. Since it is not possible to have a priori access to a variety of OOD data before deployment, several recent works have focused on designing scoring functions to quantify OOD uncertainty. These methods often find a threshold that achieves $95\%$ true positive rate (TPR) on the In-Distribution (ID) data used for training and uses this threshold for detecting OOD samples. However, this can lead to very high FPR as seen in a comprehensive evaluation in the Open-OOD benchmark, the FPR can range between 60 to 96\% on several ID and OOD dataset combinations. In contrast, practical systems deal with a variety of OOD samples on the fly and critical applications, e.g., medical diagnosis, demanding guaranteed control of the false positive rate (FPR). To meet these challenges, we propose a mathematically grounded framework for human-in-the-loop OOD detection, wherein expert feedback is used to update the threshold. This allows the system to adapt to variations in the OOD data while adhering to the quality constraints. We propose an algorithm that uses any time-valid confidence intervals based on the Law of Iterated Logarithm (LIL). Our theoretical results show that the system meets FPR constraints while minimizing the human feedback for points that are in-distribution. Another key feature of the system is that it can work with any existing post-hoc OOD uncertainty-quantification methods. We evaluate our system empirically on a mixture of benchmark OOD datasets in image classification tasks on CIFAR-10 and CIFAR-100 as in distribution datasets and show that our method can maintain FPR at most $5\%$ while maximizing TPR.
|
Harit Vishwakarma · Heguang Lin · Ramya Vinayak 🔗 |
-
|
ConvGenVisMo: Evaluation of conversational generative vision models
(
Morning Poster
)
link »
Conversational generative vision models (CGVMs) like Visual ChatGPT (Wu et al., 2023) have recently emerged from the synthesis of computer vision and natural language processing techniques. These models enable more natural and interactive communication between humans and machines, because they can understand verbal inputs from users and generate responses in natural language along with visual outputs. To make informed decisions about the usage and deployment of these models, it is important to analyze their performance through a suitable evaluation framework on realistic datasets. In this paper, we present ConvGenVisMo, a framework for the novel task of evaluating CGVMs. ConvGenVisMo introduces a new benchmark evaluation dataset for this task, and also provides a suite of existing and new automated evaluation metrics to evaluate the outputs. All ConvGenVisMo assets, including the dataset and the evaluation code, will be made available publicly on GitHub. |
Narjes Nikzad Khasmakhi · Meysam Asgari-chenaghlu · Nabiha Asghar · Philipp Schaer · Dietlind Zühlke 🔗 |
-
|
ConceptEvo: Interpreting Concept Evolution in Deep Learning Training
(
Morning Poster
)
link »
We present ConceptEvo, a unified interpretation framework for deep neural networks (DNNs) that reveals the inception and evolution of learned concepts during training. Our work fills a critical gap in DNN interpretation research, as existing methods focus on post-hoc interpretation after training. ConceptEvo presents two novel technical contributions: (1) an algorithm that generates a unified semantic space that enables side-by-side comparison of different models during training; and (2) an algorithm that discovers and quantifies important concept evolutions for class predictions. Through a large-scale human evaluation with 260 participants and quantitative experiments, we show that CONCEPTEVO discovers evolutions across different models that are meaningful to humans and important for predictions. ConceptEvo works for both modern (ConvNeXt) and classic DNNs (e.g., VGGs, InceptionV3). |
Haekyu Park · Seongmin Lee · Benjamin Hoover · Austin Wright · Omar Shaikh · Rahul Duggal · Nilaksh Das · Kevin Li · Judy Hoffman · Polo Chau 🔗 |
-
|
Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning
(
Morning Poster
)
link »
Supervised learning is an effective approach to machine learning, but it can be expensive to acquire labeled data. Active learning (AL) and partial label learning (PLL) are two techniques that can be used to reduce the annotation costs of supervised learning. AL is a strategy for reducing the annotation budget by selecting and labeling the most informative samples, while PLL is a weakly supervised learning approach to learn from partially annotated data by identifying the true hidden label. In this paper, we propose a novel approach that combines AL and PLL techniques to improve annotation efficiency. Our method leverages AL to select informative binary questions and PLL to identify the true label from the set of possible answers. We conduct extensive experiments on various benchmark datasets and show that our method achieves state-of-the-art (SoTA) performance with significantly reduced annotation costs. Our findings suggest that our method is a promising solution for cost-effective annotation in real-world applications. |
Shivangana Rawat · Chaitanya Devaguptapu · Vineeth N Balasubramanian 🔗 |
-
|
A More Robust Baseline for Active Learning by Injecting Randomness to Uncertainty Sampling
(
Morning Poster
)
link »
Active learning is important for human-computer interaction in the domain of machine learning. It strategically selects important unlabeled examples that need human annotation, reducing the labeling workload. One strong baseline strategy for active learning is uncertainty sampling, which determines importance by model uncertainty. Nevertheless, uncertainty sampling sometimes fails to outperform random sampling, thus not achieving the fundamental goal of active learning. To address this, the work investigates a simple yet overlooked remedy: injecting some randomness into uncertainty sampling. The remedy rescues uncertainty sampling from failure cases while maintaining its effectiveness in success cases. Our analysis reveals how the remedy balances the bias in the original uncertainty sampling with a small variance. Furthermore, we empirically demonstrate that injecting a mere $10$% of randomness achieves competitive performance across many benchmark datasets. The findings suggest randomness-injected uncertainty sampling can serve as a more robust baseline and a preferred choice for active learning practitioners.
|
Po-Yi Lu · Chun-Liang Li · Hsuan-Tien (Tien) Lin 🔗 |
-
|
Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance
(
Morning Poster
)
link »
While explainability is a desirable characteristic of increasingly complex black-box models, modern explanation methods have been shown to be inconsistent and contradictory. The semantics of explanations is not always fully understood – to what extent do explanations ``explain” a decision and to what extent do they merely advocate for a decision? Can we help humans gain insights from explanations accompanying \textit{correct} predictions and not over-rely on \textit{incorrect} predictions advocated for by explanations? With this perspective in mind, we introduce the notion of dissenting explanations: conflicting predictions with accompanying explanations. We first explore the advantage of dissenting explanations in the setting of model multiplicity, where multiple models with similar performance may have different predictions. In such cases, providing dissenting explanations could be done by invoking the explanations of disagreeing models. Through a pilot study, we demonstrate that dissenting explanations reduce overreliance on model predictions, without reducing overall accuracy. Motivated by the utility of dissenting explanations we present both global and local methods for their generation. |
Omer Reingold · Judy Hanwen Shen · Aditi Talati 🔗 |
-
|
Co-creating a globally interpretable model with human input
(
Afternoon Poster
)
link »
We consider an aggregated human-AI collaboration aimed at generating a joint interpretable model. The model takes the form of Boolean decision rules, where human input is provided in the form of logical conditions or as partial templates. This focus on the combined construction of a model offers a different perspective on joint decision making. Previous efforts have typically focused on aggregating outcomes rather than decisions logic. We demonstrate the proposed approach through two examples and highlight the usefulness and challenges of the approach. |
Rahul Nair 🔗 |
-
|
Creating a Bias-Free Dataset of Food Delivery App Reviews with Data Poisoning Attacks
(
Afternoon Poster
)
link »
Although artificial intelligence (AI) models created many benefits and achievements in our time, they also have a possibility of causing unexpected consequences when the models are biased. One of the main reasons why AI models are biased is due to data poisoning attacks. Therefore, it is important for AI model developers to understand how biased their training data is when determining the training data set to develop fair AI models. While the researchers reported several datasets for the purpose of the training dataset, the existing studies did not consider the possibility of data poisoning attacks that the dataset might have due to biases within the dataset. To reduce this gap, we created and validated a dataset which reflects the possibility of bias in individual reviews of food delivery apps. This study contributes to the community of AI model developers aiming at creating fair AI models by proposing a bias-free dataset of food delivery app reviews with data poisoning attacks as an example. |
Hyunmin Lee · SeungYoung Oh · JinHyun Han · Hyunggu Jung 🔗 |
-
|
Toward Model Selection Through Measuring Dataset Similarity on TensorFlow Hub
(
Afternoon Poster
)
link »
For novice developers, it is a challenge to select the most suitable model without prior knowledge of artificial intelligence (AI) development. The main objective of our system is to provide an automated approach to present models through dataset similarity. We present a system that enables novel developers to select the best model among existing models from an online community of TensorFlow Hub (TF Hub). Our strategy was to use the similarity of two datasets as a measure to determine the best model. By conducting a systematic review, we identified multiple limitations, each of which corresponds to a function to be implemented in our proposed system. We then created a model selection system that enables novice developers to select the most suitable ML model without prior knowledge of AI by implementing the identified functions. The analysis of this study reveals that our proposed system performed better by successfully addressing three out of six identified limitations. |
SeungYoung Oh · Hyunmin Lee · JinHyun Han · Hyunggu Jung 🔗 |
-
|
LayerDiffusion: Layered Controlled Image Editing with Diffusion Models
(
Afternoon Poster
)
link »
Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing. |
Pengzhi Li · Qinxuan Huang · Yikang Ding · Zhiheng Li 🔗 |
-
|
The corrupting influence of AI as a boss or Counterparty
(
Afternoon Poster
)
link »
In a recent article Kobis et al. (2021) propose a framework to identify four different primary roles in which Artificial Intelligence (AI) cause unethical or corrupt human behaviour; namely - role model, delegate, partner, and advisor. In this article we propose two further roles - AI as boss and AI as counterparty. We argue that the AI boss exerts coercive power over its employees whilst the different perceptual abilities of an AI counterparty provide an opportunity for humans to behave differently towards them than they would with human analogues. Unethical behaviour towards the AI counterparty is rationalised because it is not human. For both roles, the human will will typically not have any choice about their participation in the interaction. |
Hal Ashton · Matija Franklin 🔗 |
-
|
Participatory Personalization in Classification
(
Afternoon Poster
)
link »
Machine learning models are often personalized with information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people, but do not facilitate nor inform their consent Individuals cannot opt out of reporting personal information to a model, nor tell if they benefit from personalization in the first place. We introduce a family of classification models, called participatory systems, that let individuals opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for personalization with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, benchmarking them with common approaches for personalization and imputation. Our results demonstrate that participatory systems can facilitate and inform consent while improving performance and data use across all groups who report personal data. |
Hailey Joren · Chirag Nagpal · Katherine Heller · Berk Ustun 🔗 |
-
|
Black-Box Batch Active Learning for Regression
(
Afternoon Poster
)
link »
Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets, which repeatedly acquires labels for a batch of data points. However, many recent batch active learning methods are white-box approaches limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches: cruicially, our method only relies on model predictions. This approach is compatible with a wide range of machine learning models including regular and Bayesian deep learning models and non-differentiable models such as random forests. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We evaluate our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models |
Andreas Kirsch 🔗 |
-
|
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
(
Afternoon Poster
)
link »
We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre information, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar video. |
Zhenhui Ye · Ziyue Jiang · Yi Ren · Jinglin Liu · Chen Zhang · Xiang Yin · Zejun MA · Zhou Zhao 🔗 |
-
|
PromptCrafter: Crafting Text-to-Image Prompt through Mixed-Initiative Dialogue with LLM
(
Afternoon Poster
)
link »
Text-to-image generation model is able to generate images across a diverse range of subjects and styles based on a single prompt.Recent works have proposed a variety of interaction methods that help users understand the capabilities of models and utilize them.However, how to support users to efficiently explore the model's capability and to create effective prompts are still open-ended research questions.In this paper, we present PromptCrafter, a novel mixed-initiative system that allows step-by-step crafting of text-to-image prompt. Through the iterative process, users can efficiently explore the model's capability, and clarify their intent.PromptCrafter also supports users to refine prompts by answering various responses to clarifying questions generated by a Large Language Model. Lastly, users can revert to a desired step by reviewing the work history.In this workshop paper, we discuss the design process of PromptCrafter and our plans for follow-up studies. |
Seungho Baek · Hyerin Im · Jiseung Ryu · Ju Hyeong Park · Tak Yeon Lee 🔗 |
-
|
Designing Decision Support Systems Using Counterfactual Prediction Sets
(
Afternoon Poster
)
link »
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology based on the successive elimination algorithm that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption on the experts’ predictions over the prediction sets to achieve an exponential improvement in regret in comparison with vanilla successive elimination. We conduct a large-scale human subject study ($n =2,751$) to verify our counterfactual monotonicity assumption and compare our methodology to several competitive baselines. The results suggest that decision support systems that limit experts’ level of agency may be practical and may offer greater performance than those allowing experts to always exercise their own agency.
|
Eleni Straitouri · Manuel Gomez-Rodriguez 🔗 |
-
|
Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees
(
Afternoon Poster
)
link »
We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering that does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than $1/2$ and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantees, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings.
|
Yi Chen · Ramya Vinayak · Babak Hassibi 🔗 |
-
|
Large Language Models as a Proxy For Human Evaluation in Assessing the Comprehensibility of Disordered Speech Transcription
(
Afternoon Poster
)
link »
Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered speech. Even so, erroneous transcripts from ASR models can help people with disordered speech be better understood. Evaluating the efficacy of ASR for this use case requires a methodology for measuring the impact of transcription errors on the intended meaning and comprehensibility. Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. Here, we tuned and evaluated large language models (LLMs) and found them to be a better proxy for human evaluators compared to typical sentence similarity metrics. We further present a case-study of using our approach to make ASR model deployment decisions in a live video conversation setting. |
Katrin Tomanek · Jimmy Tobin · Subhashini Venugopalan · Richard Cave · Katie Seaver · Rus Heywood · Jordan Green 🔗 |
-
|
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
(
Afternoon Poster
)
link »
The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (\textit{socio-technical gap}). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs. By mapping HCI and current NLG evaluation methods, we identify opportunities for new evaluation methods for LLMs to narrow the socio-technical gap and pose open questions. |
Q. Vera Liao · Ziang Xiao 🔗 |
-
|
Neuro-Symbolic Models of Human Moral Judgment: LLMs as Automatic Feature Extractors
(
Afternoon Poster
)
link »
As AI systems gain prominence in society, concerns about their safety become crucial to address. There have been repeated calls to align powerful AI systems with human morality. However, attempts to do this have used black-box systems that cannot be interpreted or explained. In response, we introduce a methodology leveraging the natural language processing abilities of large language models (LLMs) and the interpretability of symbolic models to form competitive neuro-symbolic models for predicting human moral judgment. Our method involves using LLMs to extract morally-relevant features from a stimulus and then passing those features through a cognitive model that predicts human moral judgment. This approach achieves state-of-the-art performance on the MoralExceptQA benchmark, improving on the previous F1 score by 20 points and accuracy by 18 points, while also enhancing model interpretability by baring all key features in the model's computation. We propose future directions for harnessing LLMs to develop more capable and interpretable neuro-symbolic models, emphasizing the critical role of interpretability in facilitating the safe integration of AI systems into society. |
joseph kwon · Sydney Levine · Josh Tenenbaum 🔗 |
-
|
Towards Never-ending Learning of User Interfaces
(
Afternoon Poster
)
link »
Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets of static screenshots that are labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, workers labeling whether a UI element is “tappable” from a screenshot must guess using visual signifiers, and do not have the benefit of tapping on the UI element in the running app and observing the effects. In this paper, we present the Never-ending UI Learner, an app crawler that automatically installs real apps from a mobile app store and crawls them to infer semantic properties of UIs by interacting with UI elements, discovering new and challenging training examples to learn from, and continually updating machine learning models designed to predict these semantics. The Never-ending UI Learner so far has crawled for more than 5,000 device-hours, performing over half a million actions on 6,000 apps to train a highly accurate tappability model. |
Jason Wu · Rebecca Krosnick · Eldon Schoop · Amanda Swearngin · Jeffrey Bigham · Jeffrey Nichols 🔗 |
-
|
CHILLI: A data context-aware perturbation method for XAI
(
Afternoon Poster
)
link »
The trustworthiness of Machine Learning (ML) models can be difficult to assess, but is critical in high-risk or ethically sensitive applications. Many models are treated as a 'black-box' where the reasoning or criteria for a final decision is opaque to the user. To address this, some existing Explainable AI (XAI) approaches approximate model behaviour using perturbed data. However, such methods have been criticised for ignoring feature dependencies, with explanations being based on potentially unrealistic data. We propose a novel framework, CHILLI, for incorporating data context into XAI by generating contextually aware perturbations, which are faithful to the training data of the base model being explained. This is shown to improve both the soundness and accuracy of the explanations. |
Saif Anwar · Nathan Griffiths · Abhir BHALERAO · Thomas Popham · Mark Bell 🔗 |
-
|
Computational Approaches for App-to-App Retrieval and Design Consistency Check
(
Afternoon Poster
)
link »
Extracting semantic representations from mobile user interfaces (UI) and using the representations for designers' decision-making processes have shown the potential to be effective computational design support tools. Current approaches rely on machine learning models trained on small-sized mobile UI datasets to extract semantic vectors and use screenshot-to-screenshot comparison to retrieve similar-looking UIs given query screenshots. However, the usability of these methods is limited because they are often not open-sourced and have complex training pipelines for practitioners to follow, and are unable to perform screenshot set-to-set (i.e., app-to-app) retrieval. To this end, we (1) employ visual models trained with large web-scale images and test whether they could extract a UI representation in a zero-shot way and outperform existing specialized models, and (2) use mathematically founded methods to enable app-to-app retrieval and design consistency analysis. Our experiments show that our methods not only improve upon previous retrieval models but also enable multiple new applications. |
Seokhyeon Park · Wonjae Kim · Young-Ho Kim · Jinwook Seo 🔗 |
-
|
Language Models can Solve Computer Tasks
(
Afternoon Poster
)
link »
Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent recursively criticizes and improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting. We find that RCI combined with CoT performs better than either separately. |
Geunwoo Kim · Pierre Baldi · Stephen Mcaleer 🔗 |
-
|
Designing Data: Proactive Data Collection and Iteration for Machine Learning Using Reflexive Planning, Monitoring, and Density Estimation
(
Afternoon Poster
)
link »
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track & manage data collection iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model using density estimation. We instantiate designing data through our own data collection and applied ML case study. We find models trained on “designed” datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets. |
Aspen Hopkins · Fred Hohman · Luca Zappella · Dominik Moritz · Xavi Suau 🔗 |
-
|
Breadcrumbs to the Goal: Goal-Conditioned Exploration from Human-in-the-loop feedback
(
Afternoon Poster
)
link »
Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision making tasks with a non-trivial element of exploration requires either specifying carefully designed reward functions or relying on indiscriminate, novelty seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we propose a technique - Human Guided Exploration (HuGE), that is able to leverage low-quality feedback from non-expert users, which is infrequent, asynchronous and noisy, to guide exploration for reinforcement learning, without requiring careful reward specification. The key idea is to separate the challenges of directed exploration and policy learning - human feedback is used to direct exploration, while self-supervised policy learning is used to independently learn unbiased behaviors from the collected data. We show that this procedure can leverage noisy, asynchronous human feedback to learn tasks with no hand-crafted reward design or exploration bonuses. We show that HuGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots. |
Marcel Torne Villasevil · Max Balsells I Pamies · Zihan Wang · Samedh Desai · Tao Chen · Pulkit Agrawal · Abhishek Gupta 🔗 |
-
|
Human-Aligned Calibration for AI-Assisted Decision Making
(
Afternoon Poster
)
link »
Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exists data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values—an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker’s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker’s confidence on her own prediction is a sufficient condition for alignment. Experiments on a real AI-assisted decision making scenario where a classifier provides decision support to human decision makers validate our theoretical results and suggest that alignment may lead to better decisions. |
Nina Corvelo Benz · Manuel Gomez-Rodriguez 🔗 |
-
|
How Can AI Reason Your Character?
(
Afternoon Poster
)
link »
Inference of decision preferences through others' behavior observation is a crucial skill for artificial agents to collaborate with humans. While some attempts have taken in this realm, the inference speed and accuracy of current methods still need improvement. The main obstacle to achieving higher accuracy lies in the stochastic nature of human behavior, a consequence of the stochastic reward system underlying human decision-making. To address this, we propose the development of an instant inference network (IIN), surmising the partially observable agents' stochastic character. The agent's character is parameterized by weights assigned to reward components in reinforcement learning, resulting in a singular policy for each character. To train the IIN for inferring diverse characters, we develop a universal policy comprising a set of policies reflecting different characters. Once the IIN is trained to cover diverse characters using the universal policy, it can return character parameters instantly by receiving behavior trajectories. The simulation results confirm that the inference accuracy of the proposed solution outperforms state-of-the-art algorithms, despite having lower computational complexity. |
Dongsu Lee · Minhae Kwon 🔗 |
-
|
State trajectory abstraction and visualization method for explainability in reinforcement learning
(
Afternoon Poster
)
link »
Explainable AI (XAI) has demonstrated the potential to help reinforcement learning (RL) practitioners to understand how RL models work. However, XAI for users who have considerable domain knowledge but lack machine learning (ML) expertise, is understudied. Solving such a problem would enable RL experts to communicate with domain experts in producing ML solutions that better meet their intentions. This study examines a trajectory-based approach to the problem. Trajectory-based XAI appears promising in enabling non-RL experts to understand a RL model’s behavior by viewing a visual representation of the behavior that consists of trajectories that depict the transitions between the major states of the RL models. This paper proposes a framework to to create and evaluate a visual representation of RL models' behavior that is easy to understand between both RL and non-RL experts. |
Yoshiki Takagi · roderick tabalba · Jason Leigh 🔗 |
-
|
LeetPrompt: Leveraging Collective Human Intelligence to Study LLMs
(
Afternoon Poster
)
link »
Writing effective instructions (or prompts) is rapidly evolving into a dark art, spawning websites dedicated to collecting, sharing, and even selling instructions. Yet, the research efforts evaluating large language models (LLMs) either limit instructions to a predefined set or worse, make anecdotal claims without rigorously testing sufficient instructions. In reaction to this cottage industry of instruction design, we introduce LeetPrompt: a platform where people can interactively explore the space of instructions to solve problems. LeetPrompt automatically evaluates human-LLM interactions to provide insights about both LLMs as well as human-interaction behavior. With LeetPrompt, we conduct a within-subjects user study (N=20) across 10 problems from 5 domains: biology, physics, math, programming, and general knowledge. By analyzing 1178 instructions used to invoke GPT-4, we present the following findings: First, we find that participants are able to design instructions for all tasks, including those that problem setters deemed unlikely to be solved. Second, all automatic mechanisms fail to generate instructions to solve all tasks. Third, the lexical diversity of instructions is significantly correlated with whether people were able to solve the problem, highlighting the need for diverse instructions when evaluating LLMs. Fourth, many instruction strategies are unsuccessful, highlighting the misalignment between participant's conceptual model of the LLM and its functionality. Fifth, participants with experience with prompting, and with math spend significantly more time on LeetPrompt. Sixth, we find that people use more diverse instruction strategies than these automatic baselines. Finally, LeetPrompt facilitates a learning effect: participants self-reported improving as they solved each subsequent problem. |
Sebastin Santy · Ayana Bharadwaj · Sahith Dambekodi · Alex Albert · Cathy Yuan · Ranjay Krishna 🔗 |
-
|
Uncertainty Fingerprints: Interpreting Model Decisions with Human Conceptual Hierarchies
(
Afternoon Poster
)
link »
Understanding machine learning model uncertainty is essential to comprehend model behavior, ensure safe deployment, and intervene appropriately. However model confidences treat the output classes independently, ignoring relationships between classes that can reveal reasons for uncertainty, such as model confusion between related classes or an input with multiple valid labels. By leveraging human knowledge about related classes, we expand model confidence values into a hierarchy of concepts, creating an uncertainty fingerprint. An uncertainty fingerprint describes the model's confidence in every possible decision, distinguishing how the model proceeded from a broad idea to its precise prediction. Using hierarchical entropy, we compare fingerprints based on the model's decision-making process to categorize types of model uncertainty, identify common failure modes, and update dataset hierarchies. |
Angie Boggust · Hendrik Strobelt · Arvind Satyanarayan 🔗 |
-
|
Adaptive User-centered Neuro-symbolic Learning for Multimodal Interaction with Autonomous Systems
(
Afternoon Poster
)
link »
Recent advancements in Machine Learning, particularly Deep Learning, have enabled autonomous systems to perceive and comprehend objects and their environments in a perceptual subsymbolic manner. These systems can now perform object detection, sensor data fusion, and language understanding tasks.However, a growing need exists to enhance these systems to understand objects and their environments more conceptually and symbolically. It is essential to consider both the explicit teaching provided by humans (e.g., describing a situation or explaining how to act) and the implicit teaching obtained by observing human behavior (e.g., through the system's sensors) to achieve this level of powerful artificial intelligence.Thus, the system must be designed with multimodal input and output capabilities to support implicit and explicit interaction models. In this position paper, we argue for considering both types of inputs, as well as human-in-the-loop and incremental learning techniques, for advancing the field of artificial intelligence and enabling autonomous systems to learn like humans. We propose several hypotheses and design guidelines and highlight a use case from related work to achieve this goal. |
Amr Gomaa · Michael Feld 🔗 |
-
|
Workflow Discovery from Dialogues in the Low Data Regime
(
Afternoon Poster
)
link »
Text-based dialogues are now widely used to solve real-world problems. In cases where solution strategies are already known, they can sometimes be codified into workflows and used to guide humans or artificial agents through the task of helping clients. In this work, we introduce a new problem formulation that we call Workflow Discovery (WD) in which we are interested in the situation where a formal workflow may not yet exist. Still, we wish to discover the set of actions that have been taken to resolve a particular problem. We also examine a sequence-to-sequence (Seq2Seq) approach for this novel task using multiple Seq2Seq models. We present experiments where we extract workflows from dialogues in the Action-Based Conversations Dataset (ABCD) and the MultiWOZ dataset. We propose and evaluate an approach that conditions models on the set of possible actions, and we show that using this strategy, we can improve WD performance in the out-of-distribution setting. Further, on ABCD a modified variant of our Seq2Seq method achieves state-of-the-art performance on related but different tasks of Action State Tracking (AST) and Cascading Dialogue Success (CDS) across many evaluation metrics. |
Amine El Hattami · Issam Laradji · Stefania Raimondo · David Vazquez · Pau Rodriguez · Christopher Pal 🔗 |
-
|
Informed Novelty Detection in Sequential Data by Per-Cluster Modeling
(
Afternoon Poster
)
link »
Novelty detection in discrete sequences is a challenging task, since deviations from the process generating the normal data are often small or intentionally hidden. In many applications data is generated by several distinct processes so that models trained on all the data tend to over-generalize and novelties remain undetected. We propose to approach this challenge through decomposition: by clustering the data we break down the problem, obtaining simpler modeling tasks in each cluster which can be modeled more accurately. However, this comes at a cost, since the amount of training data per cluster is reduced. This is a particular problem for discrete sequences where state-of-the-art models are data-hungry. The success of this approach thus depends on the quality of the clustering, i.e., whether the individual learning problems are sufficiently simpler than the joint problem. In this paper we adapt a state-of-the-art visual analytics tool for discrete sequence clustering to obtain informed clusters from domain experts, since clustering discrete sequences automatically is a challenging and domain-specific task. We use LSTMs to further model each of the clusters. Our empirical evaluation indicates that this informed clustering outperforms automatic ones and that our approach outperforms standard novelty detection methods for discrete sequences in three real-world application scenarios. |
Linara Adilova · Siming Chen · Michael Kamp 🔗 |
-
|
How vulnerable are doctors to unsafe hallucinatory AI suggestions? A framework for evaluation of safety in clinical human-AI cooperation
(
Afternoon Poster
)
link »
As artificial intelligence-based decision support systems aim at assisting human specialists in high-stakes environments, studying the safety of the human-AI team as a whole is crucial, especially in the light of the danger posed by hallucinatory AI treatment suggestions from now ubiquitous large language models. In this work, we propose a method for safety assessment of the human-AI team in high-stakes decision-making scenarios. By studying the interactions between doctors and a decision support tool in a physical intensive care simulation centre, we conclude that most unsafe (i.e. potentially hallucinatory) AI recommendations would be stopped by the clinical team. Moreover, eye-tracking-based attention measurements indicate that doctors focus more on unsafe than safe AI suggestions. |
Paul Festor · Myura Nagendran · Anthony Gordon · Matthieu Komorowski · Aldo Faisal 🔗 |
-
|
feather - a Python SDK to share and deploy models
(
Afternoon Poster
)
link »
At its core, feather was a tool that allowed model developers to build shareable user interfaces for their models in under 20 lines of code. Using the Python SDK, developers specified visual components that users would interact with. (e.g. a FileUpload component to allow users to upload a file). Our service then provided 1) a URL that allowed others to access and use the model visually via a user interface; 2) an API endpoint to allow programmatic requests to a model.In this paper, we discuss feather's motivations and the value we intended to offer AI researchers and developers. For example, the SDK can support multi-step models and can be extended to run automatic evaluation against held out datasets. We additionally provide comprehensive technical and implementation details.N.B. feather is presently a dormant project. We will open source the SDK for research purposes upon acceptance. |
Nihir Vedd 🔗 |