Beyond Pixels: Embedding-Guided Keyframe Selection for VidLLMs
Abstract
Temporal sub-sampling is an unavoidable bottleneck for every Video Large Language Model (VidLLM), yet is almost universally performed by pixel-space heuristics before any model computation. We propose VKF (Vector Keyframe Selection), which uses the VidLLM's own SigLIP vision encoder to greedily select the k most semantically diverse frames via farthest-first traversal, inheriting a 2-approximation guarantee for the k-center problem. On UCF101-Tiny (10 action classes) with SmolVLM2-500M, VKF achieves 71.0% Top-1 accuracy, outperforming equidistant sampling by +6 pp and random sampling by +8 pp. Critically, INT8 quantization of the selection encoder exactly matches FP32 at 71.0%, halving memory relative to FP16 (172.9 MB → 86.4 MB) at zero accuracy cost. We introduce Mean Minimum Distance (MMD) as a model-aligned semantic coverage metric, with Spearman ρ = −0.786 (p = 0.021) against downstream accuracy, and release UCF101-Tiny and reproducible code.