Fast Byte Latent Transformer
Abstract
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their practical deployment is limited by slow inference. In this work, we enhance the Byte Latent Transformer (BLT) using new training and inference techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant. BLT-D is trained with an auxiliary block-wise diffusion objective over byte blocks alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially improving decoding efficiency. Second, we propose two extensions inspired by speculative decoding that trade some speed for improved quality: BLT Self-speculation (BLT-S), a faster generation method for BLT in which it speculates bytes beyond its normal patch boundaries and verifies its own generations; and BLT Diffusion+Verification (BLT-DV), which enhances BLT-D by adding an autoregressive verification step after diffusion-based generation. Each approach offers its own unique advantages, and together, they overcome key barriers to large-scale deployment of byte-level LMs.