A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets
Abstract
Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1) we symbolically generate action sequences using parametrized heuristics and refine them (LLM + human) to create 58 sequences consisting of 13K actions from publicly available spreadsheets. To address (2) we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use LMs as baseline predictive systems and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, effect of context, and effect of prediction length.