Timezone: »
Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.
Author Information
Biao Zhang (University of Edinburgh)
Biao Zhang is a final-year Ph.D. student at the ILCC at the University of Edinburgh under the supervision of Prof. Rico Sennrich and Prof. Ivan Titov. His research focuses on improving neural machine translation (NMT), particularly its efficiency and universality, including developing lightweight (fast and effective) architectures for NMT, low-resource NMT, massively multilingual NMT, speech-to-text translation, context-aware NMT, and their intersections.
Behrooz Ghorbani (Google Research)
Ankur Bapna (Google Research)
Yong Cheng (Google)
Xavier Garcia (Google)
Jonathan Shen (Independent)
Orhan Firat (Google)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Wed. Jul 20th through Thu the 21st Room Hall E #126
More from the Same Authors
-
2023 Poster: The Unreasonable Effectiveness of Few-shot Learning for Machine Translation »
Xavier Garcia · Yamini Bansal · Colin Cherry · George Foster · Maxim Krikun · Melvin Johnson · Orhan Firat -
2023 Poster: Measuring the Impact of Programming Language Distribution »
Gabriel Orlanski · Kefan Xiao · Xavier Garcia · Jeffrey Hui · Joshua Howland · Jonathan Malmaud · Jacob Austin · Rishabh Singh · Michele Catasta -
2023 Poster: Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models »
Yong Cheng · Yu Zhang · Melvin Johnson · Wolfgang Macherey · Ankur Bapna -
2023 Poster: Scaling Laws for Multilingual Neural Machine Translation »
Patrick Fernandes · Behrooz Ghorbani · Xavier Garcia · Markus Freitag · Orhan Firat -
2023 Oral: Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models »
Yong Cheng · Yu Zhang · Melvin Johnson · Wolfgang Macherey · Ankur Bapna -
2023 Session: Oral Panel: AI and Marginalized Languages »
Shruti Rijhwani · Keoni Kealoha Mahelona · Orhan Firat · Hady Elsahar · Jihyung Moon -
2022 Poster: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2022 Poster: Revisiting End-to-End Speech-to-Text Translation From Scratch »
Biao Zhang · Barry Haddow · Rico Sennrich -
2022 Poster: Data Scaling Laws in NMT: The Effect of Noise and Architecture »
Yamini Bansal · Behrooz Ghorbani · Ankush Garg · Biao Zhang · Colin Cherry · Behnam Neyshabur · Orhan Firat -
2022 Spotlight: Data Scaling Laws in NMT: The Effect of Noise and Architecture »
Yamini Bansal · Behrooz Ghorbani · Ankush Garg · Biao Zhang · Colin Cherry · Behnam Neyshabur · Orhan Firat -
2022 Spotlight: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts »
Nan Du · Yanping Huang · Andrew Dai · Simon Tong · Dmitry Lepikhin · Yuanzhong Xu · Maxim Krikun · Yanqi Zhou · Adams Wei Yu · Orhan Firat · Barret Zoph · William Fedus · Maarten Bosma · Zongwei Zhou · Tao Wang · Emma Wang · Kellie Webster · Marie Pellat · Kevin Robinson · Kathleen Meier-Hellstern · Toju Duke · Lucas Dixon · Kun Zhang · Quoc Le · Yonghui Wu · Zhifeng Chen · Claire Cui -
2022 Spotlight: Revisiting End-to-End Speech-to-Text Translation From Scratch »
Biao Zhang · Barry Haddow · Rico Sennrich -
2021 Poster: Self-supervised and Supervised Joint Training for Resource-rich Machine Translation »
Yong Cheng · Wei Wang · Lu Jiang · Wolfgang Macherey -
2021 Spotlight: Self-supervised and Supervised Joint Training for Resource-rich Machine Translation »
Yong Cheng · Wei Wang · Lu Jiang · Wolfgang Macherey -
2020 Poster: XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation »
Junjie Hu · Sebastian Ruder · Aditya Siddhant · Graham Neubig · Orhan Firat · Melvin Johnson