Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

Reverse Distillation: Training Billion Parameter Models For CTR Prediction

Aditya Anantharaman · Aashiq Muhamed · Hemant Pugaliya · Chong Wang · Sujan Perera · Zhen Ge · qingjun cui · Belinda Zeng · Trishul Chilimbi


Pre-training and fine-tuning large transformer models has shown promising results across various ML applications. Large model training brings with it a host of challenges including slow convergence, training instabilities, increased cost and resources for hyperparameter sweeps, among other difficulties. These challenges are further exacerbated when training large models on real-world internet-scale datasets as these are noisy. One such real-world internet-scale application is Click-Through Rate (CTR) prediction of product advertisements in e-commerce. In this work, we propose a method of training large models (upto 50 billion parameters) on the CTR Prediction task by making use of knowledge from smaller models for initialization through Reverse Distillation (RD). We show that our method improves over vanilla finetuning of large language models on a downstream CTR task at Amazon. We also study the effectiveness of this method at different model sizes and label noise levels in the training data. Using the proposed method we train and deploy a 50 billion parameter model which shows a lift of 6.52% in CTR during online A/B experiments.

Chat is not available.