Workshop: Hardware-aware efficient training (HAET)

GroupBERT: Enhanced Transformer Architecture with Efficient GroupedStructures

Ivan Chelombiev · Daniel Justus · Douglas Orr · Anastasia Dietrich · Frithjof Gressmann · Alexandros Koliousis · Carlo Luschi


Attention based language models have high computational requirements, due to large parameter count, dense operations and large volumes of data. We modify the structure of a Transformer layer by introducing grouped transformations to dense feed-forward layers and add a grouped convolution module. The resulting architecture shows superior computational and task performance compared to BERT model family, that translate to time and energy savings for model training.

Chat is not available.