WatchLog: Efficient and Interpretable Event Reasoning for Endpoint Detection and Response Logs with Multimodal LLMs
Abstract
Endpoint Detection and Response (EDR) systems are crucial for identifying malicious activities on endpoint devices, yet existing methods struggle to efficiently model ultra-long log sequences and to provide interpretable reasoning for security analysts. We propose WatchLog, a novel framework that represents raw logs as video-structured data, enabling scalable and expressive video-language modeling of endpoint behaviors. Each event is encoded as a key–value-guided image, and the resulting images are temporally organized into a video sequence. To capture long-range dependencies, WatchLog employs a temporal cross-attention adapter that enables pixel-wise interaction across time. The adapter acts as an auxiliary temporal reasoning pathway, aligning spatial representations with relevant temporal contexts while preserving the original behavioral semantics. We adopt a two-stage pre-training strategy followed by supervised fine-tuning to generate behavior explanations grounded in event-level semantics and detection outcomes. Experiments on our newly constructed EDR8M-20R dataset and a public benchmark demonstrate that WatchLog consistently outperforms state-of-the-art methods in detection accuracy and recall, while offering more interpretable reasoning traces and significantly improved inference efficiency. Extensive ablation studies further support the robustness and interpretability of the proposed method.