ICML Poster Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Poster

Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

Haoyu Zhang · Meng Liu · Zixin Liu · Xuemeng Song · Yaowei Wang · Liqiang Nie

[ Abstract ] [ Project Page ] [ Paper PDF ]

[ Poster]

2024 Poster

Abstract:

The challenge of interpreting the world from a human perspective in Artificial Intelligence (AI) is particularly evident in egocentric video question answering, which grapples with issues like small object recognition, noise suppression, and spatial-temporal reasoning. To address these challenges, we introduce the Multi-Factor Adaptive vision Selection (MFAS) framework. MFAS integrates a patch partition and merging module for enhanced small object recognition, a prior-guided patch selection module for noise suppression and focused analysis, and a hierarchical aggregation network to aggregate visual semantics guided by questions. Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework. Code and data are available in https://github.com/Hyu-Zhang/EgoVideoQA.

Chat is not available.