Workshop: Hardware-aware efficient training (HAET)

GroupBERT: Enhanced Transformer Architecture with Efficient GroupedStructures

Ivan Chelombiev · Daniel Justus · Douglas Orr · Anastasia Dietrich · Frithjof Gressmann · Alexandros Koliousis · Carlo Luschi


Attention based language models have high computational requirements, due to large parameter count, dense operations and large volumes of data. We modify the structure of a Transformer layer by introducing grouped transformations to dense feed-forward layers and add a grouped convolution module. The resulting architecture shows superior computational and task performance compared to BERT model family, that translate to time and energy savings for model training.

