Poster
in
Affinity Event: GlobalSouthML @ ICML 2026

Evaluation of Contextual Understanding in Large Language Models

Mamta Nallaretnam ⋅ Subavarshana Arumugam ⋅ Kithuni Wickramasinghe ⋅ Chamath Gunapala ⋅ Uthayasanker Thayasivam ⋅ Kamal Premaratne ⋅ pragatheeswaran vipulanandan

Project Page

Abstract

Large Language Models (LLMs) demonstrate impressive performance across diverse NLP tasks, yet their ability to exhibit genuine contextual understanding remains uncertain. Traditional evaluation metrics such as perplexity, BLEU, or surface-level accuracy fail to reveal how well LLMs extract, integrate, and reason over contextual information - a gap particularly critical in question answering, where models must align responses with contextually grounded knowledge rather than memorized associations. We propose a novel knowledge graph-based evaluation framework introducing S3KG, a hybrid similarity measure integrating structural and semantic similarity into a continuous evaluation score, alongside a diagnostic framework for categorizing reasoning errors.Together, these components establish a reproducible pipeline for measuring correctness, faithfulness, and interpretability in LLM-generated responses.