Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments
Abstract
Driven by the promise of scalable, personalised student support, aligning large language models with educational values has become an active area of research. A central pedagogical value in this effort is scaffolding: guiding students through graduated steps toward a solution rather than providing direct answers. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. In this paper, we empirically examine whether this assumption holds. We introduce an evaluation pipeline that investigates chatbot scaffolding and student control, applying it across nine datasets of 9,490 chats, spanning AI tutor benchmarks and scaffolding-aligned and unaligned chatbots. Our analysis reveals that students in real-world settings frequently bypass scaffolding and exercise substantially more control over the interaction than what benchmarks assume. We argue that this mismatch exposes two structural limitations in current paradigms: (1) the affordances of the chat interface, which permits students to bypass scaffolding and treat the chatbot as an on-demand answer tool, and (2) the lack of external learning context in existing AI tutors to accurately tailor their scaffolding.