StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Shiyang Li ⋅ Zijian Zhang ⋅ Winson Chen ⋅ Yuebo Luo ⋅ Mingyi Hong ⋅ Caiwen Ding
Abstract
Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose \textsc{StitchCUDA}, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a \textit{Planner} to orchestrate whole system design, a \textit{Coder} dedicated to implementing it step-by-step, and a \textit{Verifier} for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the \textit{Coder}'s ability in end-to-end GPU programming, \textsc{StitchCUDA} integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the \textit{Coder} learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent \textit{Coder}'s reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that \textsc{StitchCUDA} achieves nearly 100\% success rate on end-to-end GPU programming tasks, with 1.72$\times$ better speedup over the multi-agent baseline and 2.73$\times$ than the RL model baselines.
Successful Page Load