Making Execution Time a Trainable Reward for Code Generation
Abstract
Reinforcement learning for code generation usually treats execution as a correctness verifier: a program receives reward when it passes hidden tests. Using execution time as reward is harder, because weak tests, noisy timings, and sparse rewards can make a fast-but-wrong program look attractive. We study a controlled setting where competitive-programming problems are augmented with generated large-input tests, so correct human reference solutions have enough runtime spread to support timing-based comparisons. We then turn execution time into a reward by first checking correctness and only then comparing runtimes to human references, either as a graded quality-percentile score or as a collapsed binary top-percentile event. On Qwen 2.5 7B, the best binary percentile reward improves strict speed-conditioned pass@1 from 18.1% to 27.7%, while preserving most of the pure correctness score.