LEGO: An LLM-Enabled Hierarchical Optimizer for Tensor Computation Graphs with Structure-Aware Search and Compositional Synthesis
Abstract
Automating end-to-end GPU kernel generation with Large Language Models (LLMs) faces a critical tension between global performance and exploration efficiency. We present LEGO, a hierarchical framework that resolves this trade-off via a parallel multi-agent search over a recursive AND-OR FusionTree. LEGO synergizes two complementary flows: Top-Down Construction decomposes complex graphs into valid, context-isolated sub-problems to guarantee correctness and enable parallel exploration, while Bottom-Up Mutation speculatively fuses verified sub-plans to recover global locality for peak performance. This bi-directional mechanism effectively prunes the search space to avoid repetitive unguided sampling, while naturally parallelizing exploration, and enabling the discovery of sophisticated fusion strategies. Evaluations demonstrate that LEGO achieves 2.18x–13.48x speedups over PyTorch Eager and reduces end-to-end exploration time by up to 2.47x (with 7x token reduction) compared to monolithic baselines across diverse end-to-end models.