Timezone: »
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 6M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and comparing against CuBERT models as a strong baseline.
Author Information
Aditya Kanade (Indian Institute of Science and Google Brain)
Petros Maniatis (Google Research)
Gogul Balakrishnan (Google)
Kensen Shi (Google)
More from the Same Authors
-
2023 Poster: Can Large Language Models Reason about Program Behavior? »
Kexin Pei · David Bieber · Kensen Shi · Charles Sutton · Pengcheng Yin -
2021 Poster: SpreadsheetCoder: Formula Prediction from Semi-structured Context »
Xinyun Chen · Petros Maniatis · Rishabh Singh · Charles Sutton · Hanjun Dai · Max Lin · Denny Zhou -
2021 Spotlight: SpreadsheetCoder: Formula Prediction from Semi-structured Context »
Xinyun Chen · Petros Maniatis · Rishabh Singh · Charles Sutton · Hanjun Dai · Max Lin · Denny Zhou -
2020 Poster: Incremental Sampling Without Replacement for Sequence Models »
Kensen Shi · David Bieber · Charles Sutton