Timezone: »
Recent advances in large language model fine-tuning datasets and techniques have made them flourish as general dialogue-based assistants that are well-suited to strictly turn-based interactions. However, maintaining consistency in long-range, multi-turn dialogues remains a challenge with many applications restricting conversations to a short window. Current multi-modal vision-based interactions are also limited to turn-based interactions on a static sequence of tokenized images with VQA-style referential querying. In this work, we present an approach for performing real-time, vision-based dynamic interaction with an auto-regressive language model. Our approach enables long-range consistency through continual visual grounding of language model inputs. Grounding makes use of a winnowing mechanism to reduce a raw stream of pixels hierarchically, to a series of discrete events as conditioning variables for the language model. We present a novel dataset and benchmark for situated, visual interaction in the form of exercise coaching, and show that our approach can generate relevant and useful responses grounded in a real-time camera stream.
Author Information
Sunny Panchal (Qualcomm AI Research)
Guillaume Berger (Qualcomm Technologies Inc.)
Antoine Mercier (Qualcomm Technologies Inc)
Cornelius Böhm (Aignostics GmbH)
Florian Dietrichkeit (LifeBonus Gesundheitsmanagement GmbH)
Xuanlin Li (UCSD)
Reza Pourreza (Qualcomm)
Pulkit Madan (Qualcomm)
Apratim Bhattacharyya (Qualcomm AI Research)
Mingu Lee (Qualcomm AI Research)
Mark Todorovich (Qualcomm)
Ingo Bax (Qualcomm A.I. Research)
Roland Memisevic (Qualcomm AI Research)
More from the Same Authors
-
2023 : Look, Remember and Reason: Visual Reasoning with Grounded Rationales »
Apratim Bhattacharyya · Sunny Panchal · Reza Pourreza · Pulkit Madan · Mingu Lee · Roland Memisevic -
2023 Poster: Reparameterized Policy Learning for Multimodal Trajectory Optimization »
Zhiao Huang · Litian Liang · Zhan Ling · Xuanlin Li · Chuang Gan · Hao Su -
2023 Oral: Reparameterized Policy Learning for Multimodal Trajectory Optimization »
Zhiao Huang · Litian Liang · Zhan Ling · Xuanlin Li · Chuang Gan · Hao Su -
2022 Poster: Improving Policy Optimization with Generalist-Specialist Learning »
Zhiwei Jia · Xuanlin Li · Zhan Ling · Shuang Liu · Yiran Wu · Hao Su -
2022 Spotlight: Improving Policy Optimization with Generalist-Specialist Learning »
Zhiwei Jia · Xuanlin Li · Zhan Ling · Shuang Liu · Yiran Wu · Hao Su -
2021 : Neural Video Codec Demo »
Reza Pourreza