I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERT-large) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERT-approach”, when a model is first pre-trained in unsupervised or self-supervised manner on large unlabeled dataset, and then it is fine-tuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an application-driven design of new optimization methods using NovoGrad as example.