Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
Abstract
Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on speaker recognition, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce DramaSR-532K, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose DramaSR-LRM, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. All the data and code will be made publicly available.