Timezone: »

 
Situated Interaction with Real-Time State Conditioning of Language Models
Sunny Panchal · Guillaume Berger · Antoine Mercier · Cornelius Böhm · Florian Dietrichkeit · Xuanlin Li · Reza Pourreza · Pulkit Madan · Apratim Bhattacharyya · Mingu Lee · Mark Todorovich · Ingo Bax · Roland Memisevic

Recent advances in large language model fine-tuning datasets and techniques have made them flourish as general dialogue-based assistants that are well-suited to strictly turn-based interactions. However, maintaining consistency in long-range, multi-turn dialogues remains a challenge with many applications restricting conversations to a short window. Current multi-modal vision-based interactions are also limited to turn-based interactions on a static sequence of tokenized images with VQA-style referential querying. In this work, we present an approach for performing real-time, vision-based dynamic interaction with an auto-regressive language model. Our approach enables long-range consistency through continual visual grounding of language model inputs. Grounding makes use of a winnowing mechanism to reduce a raw stream of pixels hierarchically, to a series of discrete events as conditioning variables for the language model. We present a novel dataset and benchmark for situated, visual interaction in the form of exercise coaching, and show that our approach can generate relevant and useful responses grounded in a real-time camera stream.

Author Information

Sunny Panchal (Qualcomm AI Research)
Guillaume Berger (Qualcomm Technologies Inc.)
Antoine Mercier (Qualcomm Technologies Inc)
Cornelius Böhm (Aignostics GmbH)
Florian Dietrichkeit (LifeBonus Gesundheitsmanagement GmbH)
Xuanlin Li (UCSD)
Reza Pourreza (Qualcomm)
Pulkit Madan (Qualcomm)
Apratim Bhattacharyya (Qualcomm AI Research)
Mingu Lee (Qualcomm AI Research)
Mark Todorovich (Qualcomm)
Ingo Bax (Qualcomm A.I. Research)
Roland Memisevic (Qualcomm AI Research)

More from the Same Authors