Timezone: »
In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.
Author Information
Yamini Bansal (Google)
Behrooz Ghorbani (Google Research)
Ankush Garg (Google)
Biao Zhang (University of Edinburgh)
Biao Zhang is a final-year Ph.D. student at the ILCC at the University of Edinburgh under the supervision of Prof. Rico Sennrich and Prof. Ivan Titov. His research focuses on improving neural machine translation (NMT), particularly its efficiency and universality, including developing lightweight (fast and effective) architectures for NMT, low-resource NMT, massively multilingual NMT, speech-to-text translation, context-aware NMT, and their intersections.
Colin Cherry (Google)
Research Scientist at Google Translate who works on data quality, and speech translation.
Behnam Neyshabur (Google)
Orhan Firat (Google)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Data Scaling Laws in NMT: The Effect of Noise and Architecture »
Thu. Jul 21st through Fri the 22nd Room Hall E #230
More from the Same Authors
-
2021 : Distributional Generalization: A New Kind of Generalization (Extended Abstract) »
Preetum Nakkiran · Yamini Bansal -
2021 : Understanding the effect of sparsity on neural networks robustness »
Lukas Timpl · Rahim Entezari · Hanie Sedghi · Behnam Neyshabur · Olga Saukh -
2023 : On Privileged and Convergent Bases in Neural Network Representations »
Davis Brown · Nikhil Vyas · Yamini Bansal -
2023 : Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction »
Jonathan Pilault · Xavier Garcia · Arthur Brazinskas · Orhan Firat -
2023 Poster: The Unreasonable Effectiveness of Few-shot Learning for Machine Translation »
Xavier Garcia · Yamini Bansal · Colin Cherry · George Foster · Maxim Krikun · Melvin Johnson · Orhan Firat -
2023 Poster: Scaling Laws for Multilingual Neural Machine Translation »
Patrick Fernandes · Behrooz Ghorbani · Xavier Garcia · Markus Freitag · Orhan Firat -
2022 Poster: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2022 Poster: Revisiting End-to-End Speech-to-Text Translation From Scratch »
Biao Zhang · Barry Haddow · Rico Sennrich -
2022 Spotlight: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2022 Spotlight: Revisiting End-to-End Speech-to-Text Translation From Scratch »
Biao Zhang · Barry Haddow · Rico Sennrich -
2022 Poster: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2022 Spotlight: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2021 : Understanding the effect of sparsity on neural networks robustness »
Lukas Timpl · Rahim Entezari · Hanie Sedghi · Behnam Neyshabur · Olga Saukh -
2021 : Distributional Generalization: A New Kind of Generalization (Extended Abstract) »
Preetum Nakkiran · Yamini Bansal -
2020 Poster: XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation »
Junjie Hu · Sebastian Ruder · Aditya Siddhant · Graham Neubig · Orhan Firat · Melvin Johnson