Invited Talk
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
Enabling extremely fast inference and training performance using dataflow and custom chip
Urmish Thakker
As the pursuit of larger language models continues to push the boundaries of computational demands, the traditional silicon chip is facing a daunting memory/power wall. In this talk, we present a novel chip design, SN40L, to tackle this challenge. This chip combines a reconfigurable data-flow architecture with a tightly coupled 3-tier memory hierarchy to enable efficient compute-intensive training and memory-bound inference workloads for a wide variety of neural network architectures. We discuss the various advantages of this chip via case studies. In our first case study, we discuss how the dataflow architecture coupled with on-chip SRAM and HBM empowers operation fusion capabilities enabling 1000+ tokens/second inference performance without sacrificing on precision. In the second case study we look at the training performance of various model architectures and compare their performance against traditional kernel by kernel execution-based architectures. We show how dataflow architecture can help accelerate LLM training for traditional dense, sparse and novel state space models, while allowing one to train extremely large models on a smaller footprint. In the third case study we will discuss how you can use the strongly coupled DRAM, HBM and SRAM to develop new neural network architecture that can help scale to 100+ billion parameters efficiently. This new architecture uses a modular and coarse-grained approach to mixture of experts that allows for incremental updates to models for new capabilities and knowledge and smaller footprint execution during inference.