Seeing Without Light: A Benchmark Thermal Dataset for Robust Gesture Recognition
Abstract
In real environments, gesture recognition systems suffer from varying distances between the camera and the subject, as well as illumination changes such as low or no light. We study gesture recognition using both RGB and thermal data. In this work, we collect a dataset of 749 clips from 107 subjects performing seven gestures at 4ft, 6ft, and 8ft distances in synchronized RGB and thermal modalities. We evaluate five model families, including RGB–thermal dual‑stream fusion architectures. Dual‑stream fusion achieves high accuracy (up to 98.9\%) when trained on all distances, but cross‑distance generalization degrades when trained on a single distance and improves substantially when multiple distances are included. Moreover, models trained on thermal data transfer better to RGB in a zero‑shot setting than vice versa, revealing a clear modality asymmetry that affects real‑world deployment.