Skip to yearly menu bar Skip to main content


Oral
in
Affinity Workshop: LatinX in AI (LXAI) Research Workshop

Modeling Dynamic Social Vision Highlights Gaps Between Deep Learning and Humans

Kathy Garcia · Emalie McMahon · Colin Conwell · Michael Bonner · Leyla Isik

Keywords: [ vision ] [ social perception ] [ NeuroAI ] [ Deep Learning ] [ fmri ]


Abstract:

Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. This work has largely focused on behavioral and neural responses to static images, but recent work has suggested that in addition to the ventral visual stream specializing in static object recognition, there is a lateral visual stream that processes dynamic, social content. Here, we investigated the ability of 350+ modern image, video, and language models to predict human ratings of and neural responses to visual-social content of short video clips. We find that unlike prior benchmarks, even the best image-trained models do a poor job of explaining human behavioral judgements and neural responses. Language models outperform vision models in predicting behavior, but are less effective at modeling neural responses. In early- and mid-level lateral visual regions, video-trained models predicted neural responses far better than image-trained models. Overall, however, prediction by all models was lower for lateral than ventral visual regions, particularly in the superior temporal sulcus. Together, these results identify a major gap in AI's ability to match the human social vision, and highlight the importance of studying vision in dynamic, natural contexts.

Chat is not available.