Skip to yearly menu bar Skip to main content


Poster

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin · Ellie Cheng · Zachary Ankner · Nikunj Saunshi · Blake Elias · Amir Yazdanbakhsh · Jonathan Ragan-Kelley · Suvinay Subramanian · Michael Carbin

East Exhibition Hall A-B #E-2600
[ ] [ ]
Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a pre-determined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.

Lay Summary:

Most language models write left-to-right, even when different parts of the reply do not depend on each other. PASTA trains the model to tag those independent spans while it is composing. A lightweight interpreter reads the tags, fires off several decoding threads in parallel, and then stitches the finished chunks back into place. This parallel decoding technique largely preserves answer quality while delivering 1.2 ×–1.9 × faster responses. Crucially, the model itself—not hand-written rules—decides what can run in parallel. The approach opens a simple route to faster text generation.

Live content is unavailable. Log in and register to view live content