Track: Applications (NLP) 3

Thu 22 July 18:00 - 18:20 PDT

Oral

Calibrate Before Use: Improving Few-shot Performance of Language Models

Tony Z. Zhao · Eric Wallace · Shi Feng · Dan Klein · Sameer Singh

GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given a training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's accuracy (up to 30.0% absolute) across different choices of the prompt, while also making learning considerably more stable.

Thu 22 July 18:20 - 18:25 PDT

Spotlight

On-the-fly Rectification for Robust Large-Vocabulary Topic Inference

Moontae Lee · Sungjun Cho · Kun Dong · David Mimno · David Bindel

Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.

Thu 22 July 18:25 - 18:30 PDT

Spotlight

Towards Understanding and Mitigating Social Biases in Language Models

Paul Liang · Chiyu Wu · Louis-Philippe Morency · Ruslan Salakhutdinov

As machine learning methods are deployed in real-world settings such as healthcare, legal systems, and social science, it is crucial to recognize how they shape social biases and stereotypes in these sensitive decision-making processes. Among such real-world deployments are large-scale pretrained language models (LMs) that can be potentially dangerous in manifesting undesirable representational biases - harmful biases resulting from stereotyping that propagate negative generalizations involving gender, race, religion, and other social constructs. As a step towards improving the fairness of LMs, we carefully define several sources of representational biases before proposing new benchmarks and metrics to measure them. With these tools, we propose steps towards mitigating social biases during text generation. Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for high-fidelity text generation, thereby pushing forward the performance-fairness Pareto frontier.

Thu 22 July 18:30 - 18:35 PDT

Spotlight

Disentangling syntax and semantics in the brain with deep networks

Charlotte Caucheteux · Alexandre Gramfort · Jean-Remi King

The activations of language transformers like GPT-2 have been shown to linearly map onto brain activity during speech comprehension. However, the nature of these activations remains largely unknown and presumably conflate distinct linguistic classes. Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four combinatorial classes: lexical, compositional, syntactic, and semantic representations. We then introduce a statistical method to decompose, through the lens of GPT-2's activations, the brain activity of 345 subjects recorded with functional magnetic resonance imaging (fMRI) during the listening of ~4.6 hours of narrated text. The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices. Second, contrary to previous claims, syntax and semantics are not associated with separated modules, but, instead, appear to share a common and distributed neural substrate. Overall, this study introduces a versatile framework to isolate, in the brain activity, the distributed representations of linguistic constructs.

Thu 22 July 18:35 - 18:40 PDT

Spotlight

Cross-model Back-translated Distillation for Unsupervised Machine Translation

Xuan-Phi Nguyen · Shafiq Joty · Thanh-Tung Nguyen · Kui Wu · Ai Ti Aw

Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, CBD achieves the state of the art in the WMT'14 English-French, WMT'16 English-German and English-Romanian bilingual unsupervised translation tasks, with 38.2, 30.1, and 36.3 BLEU respectively. It also yields 1.5--3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not.

Thu 22 July 18:40 - 18:45 PDT

Spotlight

Few-shot Language Coordination by Modeling Theory of Mind

Hao Zhu · Graham Neubig · Yonatan Bisk

No man is an island. Humans develop the ability to communicate with a large community by coordinating with different interlocutors within short conversations. This ability is largely understudied by the research on building neural language communicative agents. We study the task of few-shot language coordination: agents quickly adapting to their conversational partners’ language abilities. Different from current communicative agents trained with self-play, we in- investigate this more general paradigm by requiring the lead agent to coordinate with a population of agents each of whom has different linguistic abilities. This leads to a general agent able to quickly adapt to communicating with unseen agents in the population. Unlike prior work, success here requires the ability to model the partner’s beliefs, a vital component of human communication. Drawing inspiration from the study of theory-of-mind (ToM; Premack & Woodruff (1978)), we study the effect of the speaker explicitly modeling the listener’s mental state. Learning by communicating with a population, the speakers, as shown in our experiments, acquire the ability to learn to predict the reactions of their partner upon various messages on-the-fly. The speaker’s predictions for the future actions help it generate the best instructions in order to maximize communicative goal with message costs. To examine our hypothesis that the instructions generated with ToM modeling yield better communication per- performance, we employ our agents in both a referential game and a language navigation task. Positive results from our experiments also hint at the importance of explicitly modeling language acquisition as a socio-pragmatic progress.

Thu 22 July 18:45 - 18:50 PDT

Q&A

Main Navigation

Session

Applications (NLP) 3

Calibrate Before Use: Improving Few-shot Performance of Language Models

On-the-fly Rectification for Robust Large-Vocabulary Topic Inference

Towards Understanding and Mitigating Social Biases in Language Models

Disentangling syntax and semantics in the brain with deep networks

Cross-model Back-translated Distillation for Unsupervised Machine Translation

Few-shot Language Coordination by Modeling Theory of Mind

Q&A