CIRBench: Evaluating Large Language Models as LLVM IR Optimizers
Zi Yang ⋅ Haifeng Ding ⋅ Fei Liu ⋅ Yingying Cheng ⋅ Han Cheng ⋅ Zhilei Chai ⋅ Haojie Zhou
Abstract
Large language models are beginning to introduce a new paradigm for compilation: instead of only assisting at the source level, they can operate directly on **intermediate representations (IRs)**, the compiler’s internal code representation, Early studies suggest that LLM-guided optimization can sometimes rival traditional compiler optimizations on selected programs, but evidence remains fragmented. Yet the community still lacks a rigorous IR-level benchmark that tests whether a model not only understands IR but can rewrite it under compiler-grade semantic constraints with meaningful performance impact. We present **CIRBench**, a benchmark of 800 curated IR instances spanning four compiler-oriented tracks: Analysis infers IR properties, Repair fixes invalid IR, Refactor applies a single semantics-preserving compiler optimization, and Transform performs performance-oriented rewrites, together mirroring core optimization responsibilities in modern compilers. CIRBench combines verifier, equivalence checking, and end-to-end performance measurement into a unified, layered correctness-aware evaluation of LLMs on IR. On six mainstream LLMs, CIRBench shows that current models fail on many IR analysis and rewriting instances and on median underperform the compiler baseline, but we also observe a maximum speedup of $4.96\times$ over -O3. These findings highlight both the opportunities and the remaining challenges of using LLMs inside optimizing compilers.
Successful Page Load