Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Neural Conversational AI Workshop - What’s left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) chatbots?

Situated Interaction with Real-Time State Conditioning of Language Models

Sunny Panchal · Guillaume Berger · Antoine Mercier · Cornelius Böhm · Florian Dietrichkeit · Xuanlin Li · Reza Pourreza · Pulkit Madan · Apratim Bhattacharyya · Mingu Lee · Mark Todorovich · Ingo Bax · Roland Memisevic


Abstract:

Recent advances in large language model fine-tuning datasets and techniques have made them flourish as general dialogue-based assistants that are well-suited to strictly turn-based interactions. However, maintaining consistency in long-range, multi-turn dialogues remains a challenge with many applications restricting conversations to a short window. Current multi-modal vision-based interactions are also limited to turn-based interactions on a static sequence of tokenized images with VQA-style referential querying. In this work, we present an approach for performing real-time, vision-based dynamic interaction with an auto-regressive language model. Our approach enables long-range consistency through continual visual grounding of language model inputs. Grounding makes use of a winnowing mechanism to reduce a raw stream of pixels hierarchically, to a series of discrete events as conditioning variables for the language model. We present a novel dataset and benchmark for situated, visual interaction in the form of exercise coaching, and show that our approach can generate relevant and useful responses grounded in a real-time camera stream.

Chat is not available.