Skip to yearly menu bar Skip to main content

Workshop: Hardware-aware efficient training (HAET)

A 28nm 8-bit Floating-Point CNN Training Processor with Hardware-Efficient Dynamic Sparsification and 4.7X Training Speedup

Shreyas Kolala Venkataramanaiah · Jian Meng · Han-Sok Suh · Injune Yeo · Jyotishman Saikia · Sai Kiran Cherupally · Yichi Zhang · Zhiru Zhang · Jae-sun Seo


We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3X), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7X), for both supervised and self-supervised training tasks.

Chat is not available.