Timezone: »

A 28nm 8-bit Floating-Point CNN Training Processor with Hardware-Efficient Dynamic Sparsification and 4.7X Training Speedup
Shreyas Kolala Venkataramanaiah · Jian Meng · Han-Sok Suh · Injune Yeo · Jyotishman Saikia · Sai Kiran Cherupally · Yichi Zhang · Zhiru Zhang · Jae-sun Seo

Sat Jul 23 07:00 AM (PDT) @

We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3X), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7X), for both supervised and self-supervised training tasks.

Author Information

Shreyas Kolala Venkataramanaiah (Arizona State University)
Jian Meng (Arizona State University)
Han-Sok Suh (Arizona State University)
Injune Yeo (Arizona State University)
Jyotishman Saikia (Arizona State University)
Sai Kiran Cherupally (Arizona State University)
Yichi Zhang (Cornell University)
Zhiru Zhang (Cornell Univeristy)
Jae-sun Seo (Arizona State University)

More from the Same Authors