BEST: Benchmarking Efficiency in Space and Time for LLM-Generated Code
Aocheng Shen ⋅ Boyu Zhang ⋅ Jiaze Li ⋅ Ruixuan Ma ⋅ Qiankun Zhang ⋅ Wang ⋅ Bin Yuan ⋅ Shenghao Liu ⋅ Xianjun Deng
Abstract
Large language models (LLMs) have revolutionized research in software engineering, and among various tasks, LLM-based code synthesis is promising. A recent line of benchmarks aims to evaluate LLM-generated codes in time efficiency, beyond their correctness. However, *space*, another vital aspect of code efficiency, is rarely evaluated in prior benchmarks. To fill in the gap, this paper introduces *BEST*, the first benchmark for evaluating the efficiency of LLM-generated codes in *both time and space*. It comprises $440$ coding tasks that are rigorously constructed by experts. In addition, we propose a fine-grained *subtask-based* evaluation scheme by dividing each task into multiple subtasks, with different input scales and difficulties. Each subtask is then accompanied by an expert-crafted standard implementation as the efficiency baseline, which achieves the *Pareto optimum*. Building on BEST, we introduce a unified and novel dual-indicator (time and space) metric, named dual@${k}$, generalizing the notion of the standard pass@${k}$ metric and building on a careful and novel construction of a *weight matrix* of subtasks. Through extensive experiments with dual@${k}$ across $47$ LLMs on BEST, our evaluation demonstrates that while LLMs exhibit weak capabilities in generating time-efficient code, their capabilities in space-efficient code generation are even worse. The benchmark is provided in the supplementary material.
Successful Page Load