SemRep: Code Transformation with Semantics-Preserving Representations
Abstract
Code transformation is a foundational capability in the software development process, where its effectiveness relies on constructing a high-quality code representation to characterize the input code semantics and guide the transformation. Existing approaches treat code transformation as an end-to-end learning task, leaving the construction of the representation needed for semantic reasoning implicit in model weights or relying on expensive compiler-level abstractions. We present SemRep, a framework that improves code transformation through generative code representation learning. Our key insight is to employ the semantics-preserving transformations as the intermediate representation, which will be used to train the model as a generative task, and also guide the subsequent instruction-specific code transformations. Across general code editing and CUDA kernel optimization, SemRep outperforms the strong closed-weight baselines by 6.9% and 43% in correctness, 13.9% in generalization and 6.7% in robustness. Combined with an evolutionary coding agent, SemRep finds optimizations that 685B larger-weight baselines fail to discover while achieving the same performance with 25% less inference compute.