STT-LLM: Structural-Temporal Tokenization for Adapting LLMs to Longitudinal Clinical Profiles
Abstract
Large Language Models have shown strong generalization across natural language tasks but remain underexplored for longitudinal clinical profiles. In sports anti-doping, biological profiles are analyzed to support early detection of prohibited substance use and identification of anomalous biological patterns, both of which require joint modeling of temporal dynamics and metabolic relationships. We propose STT-LLM, a structural-temporal tokenization framework that adapts LLMs to longitudinal clinical analysis without modifying their backbone architectures. STT-LLM constructs biologically grounded structural-temporal embeddings and transforms them into LLM-compatible tokens via specialized tokenizers that explicitly encode pathway structure and temporal evolution. We evaluate STT-LLM on real-world longitudinal datasets from athletes, showing consistent improvements over native LLM tokenization strategies in sequence prediction and anomaly detection. In addition, we present a case study where STT-LLM provides contextual reasoning that aligns more closely with expert assessments compared to baseline models. These results highlight tokenization as a key bottleneck and opportunity for adapting LLMs to clinical data.