Evaluation of Contextual Understanding in Large Language Models
Abstract
Large Language Models (LLMs) demonstrate impressive performance across diverse NLP tasks, yet their ability to exhibit genuine contextual understanding remains uncertain. Traditional evaluation metrics such as perplexity, BLEU, or surface-level accuracy fail to reveal how well LLMs extract, integrate, and reason over contextual information - a gap particularly critical in question answering, where models must align responses with contextually grounded knowledge rather than memorized associations. We propose a novel knowledge graph-based evaluation framework introducing S3KG, a hybrid similarity measure integrating structural and semantic similarity into a continuous evaluation score, alongside a diagnostic framework for categorizing reasoning errors.Together, these components establish a reproducible pipeline for measuring correctness, faithfulness, and interpretability in LLM-generated responses.