Timezone: »

Implementing block-sparse matrix multiplication kernels using Triton
Priya Mishra · Trevor Gale · Matei Zaharia · Cliff Young · Deepak Narayanan
Event URL: https://openreview.net/forum?id=doa11nN5vG »

MegaBlocks is the state-of-the-art system for efficient training of MoE models based on block-sparse matrix multiplication kernels. The library is currently restricted to a specific block size in the sparse matrices, data type, and GPU architecture. This is due to the CUDA kernels used for the block-sparse matrix products in the MoE layers. These kernels have been hand-tuned and manually optimized to obtain the highest performance for a specific choice of parameters. In this work, we evaluate re-writing these kernels in Triton, a Python-embedded domain specific language (DSL) for high-performance kernels for GPUs. We show that it is possible to achieve same levels of performance as the hand-tuned CUDA kernels, while maintaining portability across GPU architectures and easily supporting different block sizes and data types without any code changes. We identify the challenges and advantages of using Triton in implementing these block-sparse matrix multiplication kernels.

Author Information

Priya Mishra (Computer Science Department, Stanford University)
Trevor Gale (Google Brain)
Matei Zaharia (Stanford University)
Cliff Young
Deepak Narayanan (Microsoft Research)

More from the Same Authors