Quantifying Temperature Scaling in Discrete Sequence (Language) Models
Abstract
Naive application of token-wise temperature scaling alters the maximum a posteriori (MAP) estimate at the sequence level, degrading model performance. This issue is exacerbated in MDMs, which estimate sequence-level likelihoods with high variance under different unmasking orders. In this paper, we address the challenge of reliable temperature scaling with a novel fine-tuning procedure and introduce a new metric to measure effective temperature scaling without requiring the partition function. Our method adapts a context-dependent sequence-level temperature scaling method to any-order generative models, such as MDMs. And introduces two new, more stable learning objectives. We achieve this by computing the expected probability of a given sequence under different unmasking orders. Our experiments on language models (bd3lm) show that this leads to more consistent generation, with lower perplexity and lower generation variance.