Style Conventions Override Performance Predictions in Coding LLMs
Matthew Kotzbauer
Abstract
When a coding model considers which of two equivalent programs will run faster, what is it comparing? We build a benchmark to test when performance reasoning is able to overcome pretrained style convention given the two contradict. The benchmark contains 92 pairs of equivalent Python snippets across 16 idiom families, where the idiomatic version is the measured slower one by $\geq$1.05$\times$. We query ten models across five providers in a forced-choice format with a confidence score. Every model scores near-zero on at least one family with high confidence, with each model's failures concentrated on different families, indicating the existence of model-specific style biases. A logistic classifier over 16 syntax features reaches 0.924 accuracy, beating every frontier LLM, the best of which reaches 0.793. The results suggest that style conventions learned during training may interfere with causal performance reasoning in coding models.
Successful Page Load