Pre-training and fine-tuning large transformer models has shown promising results across various ML applications. Large model training brings with it a host of challenges including slow convergence, training instabilities, increased cost and resources for hyperparameter sweeps, among other difficulties. These challenges are further exacerbated when training large models on real-world internet-scale datasets as these are noisy. One such real-world internet-scale application is Click-Through Rate (CTR) prediction of product advertisements in e-commerce. In this work, we propose a method of training large models (upto 50 billion parameters) on the CTR Prediction task by making use of knowledge from smaller models for initialization through Reverse Distillation (RD). We show that our method improves over vanilla finetuning of large language models on a downstream CTR task at Amazon. We also study the effectiveness of this method at different model sizes and label noise levels in the training data. Using the proposed method we train and deploy a 50 billion parameter model which shows a lift of 6.52% in CTR during online A/B experiments.