Oral
in
Affinity Workshop: LatinX in AI (LXAI) Research Workshop
Modeling Dynamic Social Vision Highlights Gaps Between Deep Learning and Humans
Kathy Garcia · Emalie McMahon · Colin Conwell · Michael Bonner · Leyla Isik
Keywords: [ vision ] [ social perception ] [ NeuroAI ] [ Deep Learning ] [ fmri ]
Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. This work has largely focused on behavioral and neural responses to static images, but recent work has suggested that in addition to the ventral visual stream specializing in static object recognition, there is a lateral visual stream that processes dynamic, social content. Here, we investigated the ability of 350+ modern image, video, and language models to predict human ratings of and neural responses to visual-social content of short video clips. We find that unlike prior benchmarks, even the best image-trained models do a poor job of explaining human behavioral judgements and neural responses. Language models outperform vision models in predicting behavior, but are less effective at modeling neural responses. In early- and mid-level lateral visual regions, video-trained models predicted neural responses far better than image-trained models. Overall, however, prediction by all models was lower for lateral than ventral visual regions, particularly in the superior temporal sulcus. Together, these results identify a major gap in AI's ability to match the human social vision, and highlight the importance of studying vision in dynamic, natural contexts.