Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming
Guray Ozen
Exploiting the formidable computational capabilities of modern GPU tensor cores remains a challenging endeavor for developers. Existing programming models like CUDA and OpenCL are ill-suited for the non-SIMT nature of tensor cores, leaving a significant gap in the landscape of GPU programming languages. Vendors have primarily relied on library-based solutions or enhancements to mainstream machine learning frameworks, sacrificing the fine-grained control once afforded by CUDA in the SIMT era.In this paper, we introduce NVDSL, a Python-embedded domain-specific language that is based on MLIR compiler. NVDSL abstracts away the intricate details of tensor core programming. It allows programmers to efficiently program Hopper's Warpgroup (128 threads or 4 warps), enabling users to express sophisticated algorithms, such as multistage and warp specialization, with remarkable simplicity. We demonstrate its efficacy through two optimized GEMM kernels that achieve cuBLAS-like performance with remarkable code clarity. It is publicly available in upstream MLIR. The work is presented in EuroLLVM24 https://www.youtube.com/watch?v=V3Q9IjsgXvA.