Timezone: »
Poster
Measuring the Impact of Programming Language Distribution
Gabriel Orlanski · Kefan Xiao · Xavier Garcia · Jeffrey Hui · Joshua Howland · Jonathan Malmaud · Jacob Austin · Rishabh Singh · Michele Catasta
Event URL: https://github.com/google-research/babelcode »
Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al., 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al., 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
Author Information
Gabriel Orlanski
Kefan Xiao (Google deepmind)
Xavier Garcia (Google)
Jeffrey Hui (Google)
Joshua Howland
Jonathan Malmaud (Massachusetts Institute of Technology)
Jacob Austin (DeepMind)
Rishabh Singh (Meta)
Michele Catasta (Stanford University)
More from the Same Authors
-
2023 : Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction »
Jonathan Pilault · Xavier Garcia · Arthur Brazinskas · Orhan Firat -
2023 Poster: The Unreasonable Effectiveness of Few-shot Learning for Machine Translation »
Xavier Garcia · Yamini Bansal · Colin Cherry · George Foster · Maxim Krikun · Melvin Johnson · Orhan Firat -
2023 Poster: Scaling Laws for Multilingual Neural Machine Translation »
Patrick Fernandes · Behrooz Ghorbani · Xavier Garcia · Markus Freitag · Orhan Firat -
2022 Poster: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2022 Spotlight: Examining Scaling and Transfer of Language Model Architectures for Machine Translation »
Biao Zhang · Behrooz Ghorbani · Ankur Bapna · Yong Cheng · Xavier Garcia · Jonathan Shen · Orhan Firat -
2018 Poster: Programmatically Interpretable Reinforcement Learning »
Abhinav Verma · Vijayaraghavan Murali · Rishabh Singh · Pushmeet Kohli · Swarat Chaudhuri -
2018 Oral: Programmatically Interpretable Reinforcement Learning »
Abhinav Verma · Vijayaraghavan Murali · Rishabh Singh · Pushmeet Kohli · Swarat Chaudhuri